Vendor Recommender - EDA

@olibolly

Open TO-DO

  • Link the notebook with github using ungit - DONE
  • Provide access to the project if we go for Big query - DONE
  • Re-pull EDA using updated 2016-2017 data - DONE
  • Further EDA on collaborative filtering - DONE
  • Run first regression to understand what features matter - DONE
  • Join tables FAPIIS and USA spending

Dataset

  • USASpending.gov available on BigQuery dataset (17 years of data, 45mn transactions, $6.7tn worth of goods and services): gpqueries:contracts
  • Past Performance Information Retrieval System (PPIRS) -> review - not public data

    • System for Award Management (SAM)
    • FAPIIS
  • Are they any other dataset we should be considering??

Table gpqueries:contracts.raw

Table gpqueries:contracts.raw contains the unmodified data from the USASpending.gov archives. It's constructed from <year>_All_Contracts_Full_20160515.csv.zip files and includes contracts from 2000 to May 15, 2016.

Table gpqueries:contracts.raw contains 45M rows and 225 columns.

Each row refers to a transaction (a purchase or refund) made by a federal agency. It may be a pizza or an airplane.

The columns are grouped into categories:

  • Transaction: unique_transaction_id-baseandalloptionsvalue
  • Buyer (government agency): maj_agency_cat-fundedbyforeignentity
  • Dates: signeddate-lastdatetoorder, last_modified_date
  • Contract: contractactiontype-programacronym
  • Contractor (supplier, vendor): vendorname-statecode
  • Place of performance: PlaceofPerformanceCity-placeofperformancecongressionaldistrict
  • Product or service bought: psc_cat-manufacturingorganizationtype
  • General contract information: agencyid-idvmodificationnumber
  • Competitive procedure: solicitationid-statutoryexceptiontofairopportunity
  • Contractor details: organizationaltype-otherstatutoryauthority
  • Contractor's executives: prime_awardee_executive1-interagencycontractingauthority

Detailed description for each variable is available in the official codebook:


In [1]:
import google.datalab.bigquery as bq
import pandas as pd
import matplotlib.pyplot as plt

import numpy as np
import scipy as sp
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from sklearn import cross_validation as cv
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
from math import sqrt

In [2]:
%%sql
select * from [fiery-set-171213:vrec.sam_exclusions] limit 5


Out[2]:
ClassificationNamePrefixFirstMiddleLastSuffixAddress_1Address_2Address_3Address_4CityState___ProvinceCountryZip_CodeDUNSExclusion_ProgramExcluding_AgencyCT_CodeExclusion_TypeAdditional_CommentsActive_DateTermination_DateRecord_StatusCross_ReferenceSAM_NumberCAGENPI
DAMASOLOPEZ NUNEZCULIACAN, SINALOAMEX80450 S4MR47FDZ 
ZAKHARY KNYAZEVICHKALASHOVTBILISIGEO S4MR479RJ 
FILEMONGARCIA AYALAMATAMOROS, TAMAULIPASMEX S4MR47VPQ 
MUSAKALIMQUETTAPAK S4MR473KX 
LOS RASTROJOSECU S4MR47QK3 

(rows: 5, time: 0.3s, cached, job: job_AlEbC7YCG3b0GVDiHeYlL06ikkg)

In [4]:
%%sql
select Exclusion_Type from [fiery-set-171213:vrec.sam_exclusions] group by 1;


Out[4]:
Exclusion_Type
Ineligible (Proceedings Completed)
Voluntary Exclusion
Ineligible (Proceedings Pending)
Prohibition/Restriction

(rows: 5, time: 0.4s, 4MB processed, job: job_Pf6PhpTQ2YepmOgiqLWwQbdCR6g)

In [5]:
%%sql
select Classification from [fiery-set-171213:vrec.sam_exclusions] group by 1;


Out[5]:
Classification
Individual
Special Entity Designation
Firm
Vessel

(rows: 5, time: 0.4s, 1MB processed, job: job_ySnKks1ZSIgi66QF4mfYEKeoXKs)

In [6]:
%%sql
select
count(*)
from [fiery-set-171213:vrec.sam_exclusions] 
where Classification in ('Firm')
;


Out[6]:
f0_
8659

(rows: 1, time: 0.4s, 1MB processed, job: job_RYz4DZDpee5MvA6orEHhXQS4y-4)

There are 8,659 firms on the SAM exclusion list


In [7]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM Active_Date) as year,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
and Active_Date is not NULL
group by 1
order by 1;

In [8]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()

ax = df.plot(kind='bar', x='year', title='Excluded Firms per year', figsize=(15,8))
ax.set_xlabel('Year')
ax.set_ylabel('count')


Out[8]:
<matplotlib.text.Text at 0x7f304008ef90>

In [9]:
%%bq query
select
#Name,
SAM_Number,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and Active_Date is not NULL
group by 1
order by 2 DESC
limit 5;


Out[9]:
SAM_Numbercount
S4MR4D26H6
S4MR1SHZ24
S4MR3MGDX3
S4MR3MKV53
S4MR3MQNF3

(rows: 5, time: 3.6s, 3MB processed, job: job_4GH4nq7hyaYir9EEBcZ34MV0j9w)

In [10]:
%%bq query
select
NPI,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and CAGE is not NULL
group by 1
order by 2 DESC
limit 5;


Out[10]:
NPIcount
 8496
07
12159688474
14373332594
10837460693

(rows: 5, time: 1.1s, 1MB processed, job: job_ZVarq8hj2tCJ3usoBuRsvFCvI0E)

In [11]:
%%bq query
select
CAGE,
count(*) as count
from `fiery-set-171213.vrec.sam_exclusions`
where Classification in ('Firm')
#and CAGE is not NULL
group by 1
order by 2 DESC
limit 5;


Out[11]:
CAGEcount
8102
0TEF44
4LD154
1YEC54
5S6L24

(rows: 5, time: 7.8s, 2MB processed, job: job_S87HKiSB7xRCVzEq2v1cYJBfoMk)

NPI and CAGE don't seem to be great keys to join the data - ideally we can use SAM

Federal Awardee Performance and Integrity Information System (FAPIIS)

  • This is the contractor's fault - you can do business with these contractors on SAM one cannot do business with
  • Only 5 years by design

In [12]:
%%bq query
select * 
from `fiery-set-171213.vrec.fapiis` 
limit 5


Out[12]:
AGENCYRECORD_TYPERECORD_DATECONTRACT_GRANT_NUMBERREFERENCED_NUMBERSOLICITATION_FUNDING_NUMBERDUNSCAGEAWARDEE_NAMEADDRESS1ADDRESS2CITYSTATEZIP_CODECOUNTRYVIOLATION_AGAINSTSUBCONTRACTOR_DUNSSUBCONTRACTOR_CAGESUBCONTRACTOR_NAMESUBCONTRACTOR_ADDRESS1SUBCONTRACTOR_ADDRESS2SUBCONTRACTOR_CITYSUBCONTRACTOR_STATESUBCONTRACTOR_ZIP_CODESUBCONTRACTOR_COUNTRYFSCNAICSAGENCY_CODECODE_2OFFICIAL_NAMEOFFICIAL_PHONEOFFICIAL_EMAILTERMINATION_TYPETERMINATION_DATEUSEREMAILPHONECOMMENTS
STATE, DEPARTMENT OFTermination for Material Failure to Comply2016-06-29SWA80014GR035  539146515 LUDWIG NDINDA PRIMARY PLAYGROUND PROJECTP.O. BOX 15 OKAKARARA  NAM             1900KRISTEN ERTHUM202-647-7915erthumk@state.govN/A     
DEPT OF THE NAVYDoD Determination of Contractor Fault2014-01-21M6261313P0177  12820064 NASSAU CABLE CORP150 GREAT NECK RD STE 305 GREAT NECKNY11021US           58953314201700KIM SAKURA HIGA81989702577sakura.higa@usmc.milN/A     
STATE, DEPARTMENT OFTermination for Material Failure to Comply2016-02-08SCE20012GR094  659427269 ORGANIZATION FOR HABITATION AND RESOURCES DEVELOPMENT258, 18, RAILWAY STATION ROAD, THILLAYADI PUTTALAM  LKA             1900ELIADAH COE202-647-9544coeem@state.govN/A     
INTERIOR, DEPARTMENT OF THESubcontractor Payment Issues2015-11-05INR15PX00625  44271201 D T M ASSOCIATES LIMITED LIABILITY COMPANY1200 PENNSYLVANIA AVE MCDONOUGHGA302538999USA           37503331121400LYNETTE ROCK801-524-3798lrock@usbr.govN/A     
DEPT OF THE NAVYDoD Determination of Contractor Fault2017-01-09N6883616P1114  798628587E6E6OBELLO INC7604 E 117TH TER KANSAS CITYMO641343925USA           D3995179191700DAVID DIAZ912-573-2656david.diaz1@navy.milN/A     

(rows: 5, time: 1.3s, 943KB processed, job: job_aRVGiTVZ6jOJTHeAr_gzfb_x__Q)

In [13]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM RECORD_DATE) as year,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 1;

In [14]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()

ax = df.plot(kind='bar', x='year', title='Firms by Record date', figsize=(10,5))
ax.set_xlabel('Year')
ax.set_ylabel('count')


Out[14]:
<matplotlib.text.Text at 0x7f3033c05f50>

In [15]:
%%bq query -n df_query
select
EXTRACT(YEAR FROM TERMINATION_DATE) as year,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 1;

In [16]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()

ax = df.plot(kind='bar', x='year', title='Firms by termination date', figsize=(10,5))
ax.set_xlabel('Year')
ax.set_ylabel('count')


Out[16]:
<matplotlib.text.Text at 0x7f3033c13850>

In [17]:
%%bq query
select
AWARDEE_NAME,
DUNS,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1,2
order by 3 DESC
limit 5;


Out[17]:
AWARDEE_NAMEDUNScount
ALPHA RAPID ENGINEERING SOLUTIONS96881849796
SDV RECON INC82847949872
GIESTING & ASSOCIATES, INC.6893984241
UNITED PROCUREMENT, L.P.62803634531
EPITECH, INC.19037862531

(rows: 5, time: 1.2s, 101KB processed, job: job_s4zudvnCBv32DSETuG1Vb1HoCGg)

In [18]:
%%bq query
select
*
from `fiery-set-171213.vrec.fapiis`
where AWARDEE_NAME in ('ALPHA RAPID ENGINEERING SOLUTIONS')
limit 5;


Out[18]:
AGENCYRECORD_TYPERECORD_DATECONTRACT_GRANT_NUMBERREFERENCED_NUMBERSOLICITATION_FUNDING_NUMBERDUNSCAGEAWARDEE_NAMEADDRESS1ADDRESS2CITYSTATEZIP_CODECOUNTRYVIOLATION_AGAINSTSUBCONTRACTOR_DUNSSUBCONTRACTOR_CAGESUBCONTRACTOR_NAMESUBCONTRACTOR_ADDRESS1SUBCONTRACTOR_ADDRESS2SUBCONTRACTOR_CITYSUBCONTRACTOR_STATESUBCONTRACTOR_ZIP_CODESUBCONTRACTOR_COUNTRYFSCNAICSAGENCY_CODECODE_2OFFICIAL_NAMEOFFICIAL_PHONEOFFICIAL_EMAILTERMINATION_TYPETERMINATION_DATEUSEREMAILPHONECOMMENTS
DEFENSE COMMISSARY AGENCY (DECA)Termination for Cause2015-09-04HDEC0413D00170257  9688184976GLG3ALPHA RAPID ENGINEERING SOLUTIONS3159 SHRADER RD DOVERNJ7801USA           742033331897AZDESSIE MILLS8047348000dessie.mills@deca.milComplete     
DEFENSE COMMISSARY AGENCY (DECA)Termination for Cause2015-09-04HDEC0413D00170259  9688184976GLG3ALPHA RAPID ENGINEERING SOLUTIONS3159 SHRADER RD DOVERNJ7801USA           742033331897AZDESSIE MILLS8047348000dessie.mills@deca.milComplete     
DEFENSE COMMISSARY AGENCY (DECA)Termination for Cause2015-09-04HDEC0413D00170260  9688184976GLG3ALPHA RAPID ENGINEERING SOLUTIONS3159 SHRADER RD DOVERNJ7801USA           742033331897AZDESSIE MILLS8047348000dessie.mills@deca.milComplete     
DEFENSE COMMISSARY AGENCY (DECA)Termination for Cause2015-09-02268HDEC0413D0017 9688184976GLG3ALPHA RAPID ENGINEERING SOLUTIONS3159 SHRADER RD DOVERNJ78015735USA           742033331897AZDESSIE MILLS8047348000dessie.mills@deca.milComplete     
DEFENSE COMMISSARY AGENCY (DECA)Termination for Cause2015-09-04HDEC0413D00170258  9688184976GLG3ALPHA RAPID ENGINEERING SOLUTIONS3159 SHRADER RD DOVERNJ7801USA           742033331897AZDESSIE MILLS8047348000dessie.mills@deca.milComplete     

(rows: 5, time: 1.7s, 943KB processed, job: job_sGxn_-8wDhf9VNVCfpJ1V-yzcqY)

In [19]:
%%bq query
select
RECORD_TYPE,
count(*) as count
from `fiery-set-171213.vrec.fapiis`
group by 1
order by 2 DESC


Out[19]:
RECORD_TYPEcount
Termination for Cause1603
Termination for Default1182
Administrative Agreement158
Non-Responsibility Determination48
Termination for Material Failure to Comply8
DoD Determination of Contractor Fault2
Subcontractor Payment Issues1

(rows: 7, time: 1.4s, 72KB processed, job: job_CUUCrloWjHkOFNz5rBo1EQSgFQc)

FAPIIS is not bad with 3002 DUNS code but time range goes only from 2012 to 2017

USA Spending


In [3]:
%%bq query -n df_query
select count(*) as transactions
from `fiery-set-171213.vrec.usa_spending_all`

In [4]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()


Out[4]:
transactions
0 49454646

In [22]:
%%bq query
select *
from `fiery-set-171213.vrec.usa_spending_all`
where mod_agency in ('1700: DEPT OF THE NAVY')
limit 5


Out[22]:
unique_transaction_idtransaction_statusdollarsobligatedbaseandexercisedoptionsvaluebaseandalloptionsvaluemaj_agency_catmod_agencymaj_fund_agency_catcontractingofficeagencyidcontractingofficeidfundingrequestingagencyidfundingrequestingofficeidfundedbyforeignentitysigneddateeffectivedatecurrentcompletiondateultimatecompletiondatelastdatetoordercontractactiontypereasonformodificationtypeofcontractpricingpriceevaluationpercentdifferencesubcontractplanlettercontractmultiyearcontractperformancebasedservicecontractmajorprogramcodecontingencyhumanitarianpeacekeepingoperationcontractfinancingcostorpricingdatacostaccountingstandardsclausedescriptionofcontractrequirementpurchasecardaspaymentmethodnumberofactionsnationalinterestactioncodeprogsourceagencyprogsourceaccountprogsourcesubacctaccount_titlerec_flagtypeofidcmultipleorsingleawardidcprogramacronymvendornamevendoralternatenamevendorlegalorganizationnamevendordoingasbusinessnamedivisionnamedivisionnumberorofficecodevendorenabledvendorlocationdisableflagccrexceptionstreetaddressstreetaddress2streetaddress3citystatezipcodevendorcountrycodevendor_state_codevendor_cdcongressionaldistrictvendorsitecodevendoralternatesitecodedunsnumberparentdunsnumberphonenofaxnoregistrationdaterenewaldatemod_parentlocationcodestatecodePlaceofPerformanceCitypop_state_codeplaceofperformancecountrycodeplaceofperformancezipcodepop_cdplaceofperformancecongressionaldistrictpsc_catproductorservicecodesystemequipmentcodeclaimantprogramcodeprincipalnaicscodeinformationtechnologycommercialitemcategorygfe_gfpuseofepadesignatedproductsrecoveredmaterialclausesseatransportationcontractbundlingconsolidatedcontractcountryoforiginplaceofmanufacturemanufacturingorganizationtypeagencyidpiidmodnumbertransactionnumberfiscal_yearidvagencyididvpiididvmodificationnumbersolicitationidextentcompetedreasonnotcompetednumberofoffersreceivedcommercialitemacquisitionprocedurescommercialitemtestprogramsmallbusinesscompetitivenessdemonstrationprograma76actioncompetitiveproceduressolicitationprocedurestypeofsetasidelocalareasetasideevaluatedpreferencefedbizoppsresearchstatutoryexceptiontofairopportunityorganizationaltypenumberofemployeesannualrevenuefirm8aflaghubzoneflagsdbflagissbacertifiedsmalldisadvantagedbusinessshelteredworkshopflaghbcuflageducationalinstitutionflagwomenownedflagveteranownedflagsrdvobflaglocalgovernmentflagminorityinstitutionflagaiobflagstategovernmentflagfederalgovernmentflagminorityownedbusinessflagapaobflagtribalgovernmentflagbaobflagnaobflagsaaobflagnonprofitorganizationflagisothernotforprofitorganizationisforprofitorganizationisfoundationhaobflagishispanicservicinginstitutionemergingsmallbusinessflaghospitalflagcontractingofficerbusinesssizedeterminationis1862landgrantcollegeis1890landgrantcollegeis1994landgrantcollegeisveterinarycollegeisveterinaryhospitalisprivateuniversityorcollegeisschoolofforestryisstatecontrolledinstitutionofhigherlearningisserviceproviderreceivescontractsreceivesgrantsreceivescontractsandgrantsisairportauthorityiscouncilofgovernmentsishousingauthoritiespublicortribalisinterstateentityisplanningcommissionisportauthorityistransitauthorityissubchapterscorporationislimitedliabilitycorporationisforeignownedandlocatedisarchitectureandengineeringisdotcertifieddisadvantagedbusinessenterpriseiscitylocalgovernmentiscommunitydevelopedcorporationownedfirmiscommunitydevelopmentcorporationisconstructionfirmismanufacturerofgoodsiscorporateentitynottaxexemptiscountylocalgovernmentisdomesticshelterisfederalgovernmentagencyisfederallyfundedresearchanddevelopmentcorpisforeigngovernmentisindiantribeisintermunicipallocalgovernmentisinternationalorganizationislaborsurplusareafirmislocalgovernmentownedismunicipalitylocalgovernmentisnativehawaiianownedorganizationorfirmisotherbusinessororganizationisotherminorityownedispartnershiporlimitedliabilitypartnershipisschooldistrictlocalgovernmentissmallagriculturalcooperativeissoleproprietorshipistownshiplocalgovernmentistriballyownedfirmistribalcollegeisalaskannativeownedcorporationorfirmiscorporateentitytaxexemptiswomenownedsmallbusinessisecondisadvwomenownedsmallbusinessisjointventurewomenownedsmallbusinessisjointventureecondisadvwomenownedsmallbusinesswalshhealyactservicecontractactdavisbaconactclingercohenactotherstatutoryauthorityprime_awardee_executive1prime_awardee_executive1_compensationprime_awardee_executive2prime_awardee_executive2_compensationprime_awardee_executive3prime_awardee_executive3_compensationprime_awardee_executive4prime_awardee_executive4_compensationprime_awardee_executive5prime_awardee_executive5_compensationinteragencycontractingauthoritylast_modified_date
f6020a9fe79d606e60f83102eb4ce222active0.00.0-684.09700: DEPT OF DEFENSE1700: DEPT OF THE NAVY:1700: DEPT OF THE NAVYN62467: NAVY FACILITIES ENGINEERING COMMAND::Not Applicable09/30/200209/30/200201/01/190001/01/190001/01/1900IDC Indefinite Delivery Contract:NO: Not Reported :XN: NoN: NO - SERVICE WHERE PBA IS NOT USED.:::X::S: SINGLE AWARDNAVALES ENTERPRISES INCORPORATED115 EAST TRAVIS STREET SUITE 1018SAN ANTONIOTX782051611UNITED STATESTXTX20TX201545647280000PR78205161115456472815456472812/04/200108/10/2005NAVALES ENTERPRISES INC.::::::::N: Transaction does not use GFE/GFP:::D: NOT A BUNDLED REQUIREMENTN: No::9700: DEPT OF DEFENSEN6246701D3212P000032002B: NOT AVAILABLE FOR COMPETITIONOTH: AUTHORIZED BY STATUTE1D: N: NoN: NONoNP: NEGOTIATED PROPOSAL/QUOTENONE: NO SET ASIDE USED.NONE: NO PREFERENCE USEDX803500000.0NYNNNNNYNNNNNNNYNNNNNNNNNYNYNS: SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNX: Not ApplicableY: YesX: Not ApplicableN: No0.00.00.00.00.0X: Not Applicable09/23/2006
c5f34300d68a74b07c2f6fea94417767active0.00.00.09700: DEPT OF DEFENSE1700: DEPT OF THE NAVY:1700: DEPT OF THE NAVYC2472: NAVAL FACILITIES ENGINEERING COM::Not Applicable09/30/200209/30/200201/01/190001/01/190012/15/2002IDC Indefinite Delivery Contract:J: Firm Fixed Price B: PLAN NOT REQUIRED XN: NoN: NO - SERVICE WHERE PBA IS NOT USED.::N: NoX:NONE: NoneS: SINGLE AWARDOAK ENVIRONMENTAL CONSULTANTS,GREENTREE MEWSMARLTONNJ08053UNITED STATESNJNJ03NJ0380957137580957137501/01/190001/01/1900THE OAK GROUP INC:::ZZ161: MAINT-REP-ALT/FAMILY HOUSING:C2: CONSTRUCTION233220: MULTIFAMILY HOUSING CONSTRUCTION:N: Transaction does not use GFE/GFP::U: UNKNOWND: NOT A BUNDLED REQUIREMENTN: No::9700: DEPT OF DEFENSEN6247200D695902002A: FULL AND OPEN COMPETITION:4D: N: NoY: YESNo:NONE: NO SET ASIDE USED.:X01000000.0NNNNNNNNNNNNNNNNNNNNNNNNNNNNNS: SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNY: YesX: Not ApplicableX: Not ApplicableN: No0.00.00.00.00.0X: Not Applicable11/16/2008
c18e540e266ef038ee5057dedafa09ebactive0.00.00.09700: DEPT OF DEFENSE1700: DEPT OF THE NAVY:1700: DEPT OF THE NAVYC2836: FAR EAST CONTRACTS OFFICE::Not Applicable04/26/200204/26/200201/01/190001/01/190008/15/2002IDC Indefinite Delivery Contract:J: Firm Fixed Price B: PLAN NOT REQUIRED XN: NoX: NOT APPLICABLE::N: NoX:NONE: NoneS: SINGLE AWARDTOYOTA MOTOR CORP.1-4-18, KORAKUBUNKYO-KUJA112-8JA69123986769056473701/01/190001/01/1900TOYOTA MOTOR CORPORATION:::232310: PASSENGER MOTOR VEHICLES:C2: CONSTRUCTION441229: ALL OTHER MOTOR VEHICLE DEALERS:N: Transaction does not use GFE/GFP:A: FAR 52.223-4 INCLUDEDU: UNKNOWND: NOT A BUNDLED REQUIREMENTN: No::9700: DEPT OF DEFENSEN6283602D220602002A: FULL AND OPEN COMPETITION:5D: N: NoN: NONo:NONE: NO SET ASIDE USED.:X00.0NNNNNNNNNNNNNNNNNNNNNNNNNNNNNO: OTHER THAN SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNX: Not ApplicableX: Not ApplicableX: Not ApplicableN: No0.00.00.00.00.0X: Not Applicable11/16/2008
9e6f87573e066a3b40278ccd5e0257a7active0.00.050000000.09700: DEPT OF DEFENSE1700: DEPT OF THE NAVY:1700: DEPT OF THE NAVYC2477: NAVAL FACILITIES ENGINEERING COM::Not Applicable10/01/200110/01/200101/01/190001/01/190007/15/2003IDC Indefinite Delivery Contract:J: Firm Fixed Price B: PLAN NOT REQUIRED XN: NoN: NO - SERVICE WHERE PBA IS NOT USED.::Y: YesX:NONE: NoneM: MULTIPLE AWARDVITERI CONSTRUCTION MANAGEMENT11841 CANON BLVDNEWPORT NEWSVA23606UNITED STATESVAVA01VA0160589882460589882401/01/190001/01/1900VITERI CONSTRUCTION MANAGEMENT INC.:::YY119: CONSTRUCT/OTHER ADMIN & SVCS BLDGS:C2: CONSTRUCTION233320: COMMERCIAL AND INSTITUTIONAL BUILDING CONSTRUCTION:N: Transaction does not use GFE/GFP::U: UNKNOWND: NOT A BUNDLED REQUIREMENTN: No::9700: DEPT OF DEFENSEN6247700D021202002D: FULL AND OPEN COMPETITION AFTER EXCLUSION OF SOURCES:24D: N: NoY: YESNo:8A: 8A COMPETED:X02000000.0NNYYNNNNNNNNNNNNNNNNNNNNNNNNNS: SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNX: Not ApplicableX: Not ApplicableY: YesN: No0.00.00.00.00.0X: Not Applicable02/12/2006
e9f432c59b29f480d1da3f4dd83f9311active0.00.00.09700: DEPT OF DEFENSE1700: DEPT OF THE NAVY:1700: DEPT OF THE NAVYNJ30A: NAVAL REGIONAL CONTRACTING CENTE::Not Applicable03/26/200303/26/200301/01/190001/01/190004/15/2003IDC Indefinite Delivery Contract:J: Firm Fixed Price B: PLAN NOT REQUIRED XN: NoX: NOT APPLICABLE::N: NoX:NONE: NoneS: SINGLE AWARDKAMAL M AL SULTAN COP O BOX 665SHUWAIKH INDUSKUKU53515237553515237501/01/190001/01/1900KAMAL M AL SULTAN CO:::898945: FOOD, OILS AND FATS:C2: CONSTRUCTION424410: GENERAL LINE GROCERY MERCHANT WHOLESALERS:N: Transaction does not use GFE/GFP::N: NOD: NOT A BUNDLED REQUIREMENTN: No::9700: DEPT OF DEFENSEN4940003AE42502003A: FULL AND OPEN COMPETITION:3A: N: NoN: NONoSP1: SIMPLIFIED ACQUISITIONNONE: NO SET ASIDE USED.:X00.0NNNNNNNNNNNNNNNNNNNNNNNNNNNNNO: OTHER THAN SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNX: Not ApplicableX: Not ApplicableX: Not ApplicableN: No0.00.00.00.00.0X: Not Applicable11/16/2008

(rows: 5, time: 1.6s, 93GB processed, job: job_nBwQHKanq9PbypVCsB7JEXHY3GY)

In [23]:
%%bq query -n df_query
select
#substr(signeddate, 1, 2) month,
fiscal_year as year,  
count(*) transactions,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by year
order by year asc

In [24]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.set_index('year')['dollarsobligated'].plot(kind='bar', title='Government purchases by years')
ax.set_ylabel('dollars obligated')


Out[24]:
<matplotlib.text.Text at 0x7f30321c8f90>

In [25]:
%%bq query -n df_query
select
fiscal_year as year,  
sum(dollarsobligated)/count(*) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by year
order by year asc

In [26]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
ax = df.set_index('year')['dollarsobligated'].plot(kind='bar', title='avg. transaction size by years')
ax.set_ylabel('dollars obligated')


Out[26]:
<matplotlib.text.Text at 0x7f303219b410>

Which means we're dealing with 49.5M transactions totalling 6.7 trillion dollars. These purchases came from 622k vendors that won 2.2mn solicitations issued by government agencies.


In [27]:
%%bq query 
select
maj_agency_cat,
mod_agency,
count(*)
from `fiery-set-171213.vrec.usa_spending_all`
group by 1,2
order by 3 DESC
limit 20


Out[27]:
maj_agency_catmod_agencyf0_
9700: DEPT OF DEFENSE97AS: DEFENSE LOGISTICS AGENCY5829407
4700: GENERAL SERVICES ADMINISTRATION4730: FEDERAL ACQUISITION SERVICE5001861
3600: VETERANS AFFAIRS, DEPARTMENT OF3600: VETERANS AFFAIRS, DEPARTMENT OF4956277
9700: DEPT OF DEFENSE2100: DEPT OF THE ARMY4338326
9700: Department of Defense3663469
9700: DEPT OF DEFENSE1700: DEPT OF THE NAVY3561991
9700: Department of Defense97AS: DEFENSE LOGISTICS AGENCY3470140
9700: DEPT OF DEFENSE5700: DEPT OF THE AIR FORCE1701478
4700: GENERAL SERVICES ADMINISTRATION4732: FEDERAL ACQUISITION SERVICE1444491
4700: GENERAL SERVICES ADMINISTRATION4740: PUBLIC BUILDINGS SERVICE1436683
1500: JUSTICE, DEPARTMENT OF1540: FEDERAL PRISON SYSTEM895777
1900: STATE, DEPARTMENT OF1900: STATE, DEPARTMENT OF767320
9700: Department of Defense2100: DEPT OF THE ARMY538689
9700: Department of Defense1700: DEPT OF THE NAVY481765
7000: HOMELAND SECURITY, DEPARTMENT OF7008: U.S. COAST GUARD419564
1200: AGRICULTURE, DEPARTMENT OF12C2: FOREST SERVICE381536
8000: NATIONAL AERONAUTICS AND SPACE ADMINISTRATION8000: NATIONAL AERONAUTICS AND SPACE ADMINISTRATION346497
3600: Department of Veterans Affairs3600: VETERANS AFFAIRS, DEPARTMENT OF331327
7500: HEALTH AND HUMAN SERVICES, DEPARTMENT OF7529: NATIONAL INSTITUTES OF HEALTH325444
9700: DEPT OF DEFENSE97AK: DEFENSE INFORMATION SYSTEMS AGENCY (DISA)266946

(rows: 20, time: 1.9s, 3GB processed, job: job_aJH44_BXKfXqvsuQUZLf05seXzA)

In [28]:
%%bq query
select
mod_parent,
vendorname,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
group by 1,2
order by 3 DESC
limit 20


Out[28]:
mod_parentvendornamedollarsobligated
LOCKHEED MARTIN CORPORATIONLOCKHEED MARTIN CORPORATION4.06140595405e+11
RAYTHEON COMPANYRAYTHEON COMPANY1.43185541382e+11
THE BOEING COMPANYTHE BOEING COMPANY81604985662.4
THE BOEING COMPANYMCDONNELL DOUGLAS CORPORATION80799250038.5
GENERAL DYNAMICS CORPORATIONELECTRIC BOAT CORPORATION59956826888.2
THE BOEING COMPANYBOEING COMPANY, THE58976360811.5
MCKESSON CORPORATIONMCKESSON CORPORATION51476224729.7
MISCELLANEOUS FOREIGN CONTRACTORSMISCELLANEOUS FOREIGN CONTRACTORS45689431658.1
NORTHROP GRUMMAN CORPORATIONNORTHROP GRUMMAN SYSTEMS CORPORATION43705606171.4
UNITED TECHNOLOGIES CORPORATIONSIKORSKY AIRCRAFT CORPORATION42109220881.9
KBR INC.KELLOGG BROWN AND ROOT SERVICES INCORPORATED (1250)40346039073.6
LOCKHEED MARTIN CORPORATIONSANDIA CORP39244387380.0
UNITED TECHNOLOGIES CORPORATIONUNITED TECHNOLOGIES CORPORATION30902911172.6
GENERAL ELECTRIC COMPANYGENERAL ELECTRIC COMPANY30031810973.2
SAIC INC.SCIENCE APPLICATIONS INTERNATIONAL CORPORATION27233963756.4
L-3 COMMUNICATIONS HOLDINGS INC.L-3 COMMUNICATIONS CORPORATION26627349879.0
COMPUTER SCIENCES CORPORATIONCOMPUTER SCIENCES CORPORATION26373368163.9
LOCKHEED MARTIN CORPORATIONLOCKHEED MARTIN CORPORATION (3632)24906143203.2
CALIFORNIA INSTITUTE OF TECHNOLOGYCALIFORNIA INSTITUTE OF TECHNOLOGY24371979504.1
BELL BOEING JOINT PROJECT OFFICEBELL BOEING JOINT PROJECT OFFICE24325611915.8

(rows: 20, time: 2.3s, 3GB processed, job: job_U0FhvEK_u7QW9IUK6nCYESwtYm0)

Understanding where the budget is spent


In [29]:
%%bq query
select
productorservicecode,
systemequipmentcode,
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorname in ('LOCKHEED MARTIN CORPORATION')
group by 1,2,3,4
order by 5 DESC
limit 20


Out[29]:
productorservicecodesystemequipmentcodeclaimantprogramcodeprincipalnaicscodedollarsobligated
1510: AIRCRAFT, FIXED WING198: JSF (F-35) A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING28184206750.3
AC15: DEFENSE AIRCRAFT (OPERATIONAL):A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING26151446019.5
1510: AIRCRAFT, FIXED WING:A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING19916460350.7
1510: AIRCRAFT, FIXED WING265: F/A-22 A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING15905680755.8
1510: AIRCRAFT, FIXED WING198: F-35A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING14576942151.1
1510: AIRCRAFT, FIXED WING220: C130-J A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING9329833845.35
AR92: R&D-OTHER SPACE-A RES/EXPL DEV:A2: MISSILE AND SPACE SYSTEMS541710: RESEARCH AND DEVELOPMENT IN THE PHYSICAL, ENGINEERING, AND LIFE SCIENCES7604013454.81
1510: AIRCRAFT, FIXED WING198: F-35A1A: AIRFRAMES AND SPARES3364116729682365.63
AC15: R&D- DEFENSE SYSTEM: AIRCRAFT (OPERATIONAL SYSTEMS DEVELOPMENT):A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING6320634728.51
1420: GUIDED MISSILE COMPONENTS178: TRIDENT II MISSILE A2: MISSILE AND SPACE SYSTEMS336414: GUIDED MISSILE AND SPACE VEHICLE MANUFACTURING6302806071.0
AC23: R&D-MISSILE & SPACE SYS-ADV DEVCNX: THAAD-THEATER HIGH ALT AIR DEFA2: MISSILE AND SPACE SYSTEMS541710: RESEARCH AND DEVELOPMENT IN THE PHYSICAL, ENGINEERING, AND LIFE SCIENCES5949829201.77
1410: GUIDED MISSILESCAA: MISSILE DEFENSE AGENCY SUPPORTA2: MISSILE AND SPACE SYSTEMS336414: GUIDED MISSILE AND SPACE VEHICLE MANUFACTURING5853241426.12
1510: AIRCRAFT, FIXED WING220: C-130J A1A: AIRFRAMES AND SPARES3364115045122922.06
1810: SPACE VEHICLES210: SBIRS HIGHA2: MISSILE AND SPACE SYSTEMS336414: GUIDED MISSILE AND SPACE VEHICLE MANUFACTURING4607513860.0
1905: COMBAT SHIPS AND LANDING VESSELS374: LCS A3: SHIPS336611: SHIP BUILDING AND REPAIRING4529595335.61
AR33: SPACE FLIGHT (ADVANCED)::541710: RESEARCH AND DEVELOPMENT IN THE PHYSICAL, ENGINEERING, AND LIFE SCIENCES4403909376.94
AR33: R&D- SPACE: FLIGHT (ADVANCED DEVELOPMENT)::541710: RESEARCH AND DEVELOPMENT IN THE PHYSICAL, ENGINEERING, AND LIFE SCIENCES4339559223.83
1420: GUIDED MISSILE COMPONENTS106: PATRIOT A2: MISSILE AND SPACE SYSTEMS336419: OTHER GUIDED MISSILE AND SPACE VEHICLE PARTS AND AUXILIARY EQUIPMENT MANUFACTURING3931051623.01
J016: MAINT/REPAIR/REBUILD OF EQUIPMENT- AIRCRAFT COMPONENTS AND ACCESSORIES265: F/A-22 S1: SERVICES488190: OTHER SUPPORT ACTIVITIES FOR AIR TRANSPORTATION3840693155.95
AC17: R&D-AIRCRAFT-COMERCLIZ265: F/A-22 A1A: AIRFRAMES AND SPARES541710: RESEARCH AND DEVELOPMENT IN THE PHYSICAL, ENGINEERING, AND LIFE SCIENCES3685231281.42

(rows: 20, time: 1.9s, 7GB processed, job: job_x_0PqfgBlhqVEN6w5Ex8rnRRmsE)

In [30]:
%%bq query
select
#mod_parent,
vendorname,
systemequipmentcode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
group by 1,2
order by 3 DESC
limit 20


Out[30]:
vendornamesystemequipmentcodedollarsobligated
LOCKHEED MARTIN CORPORATION198: JSF (F-35) 28184206750.3
LOCKHEED MARTIN CORPORATION:26614890662.8
MCDONNELL DOUGLAS CORPORATION200: C-17A 22087470030.8
LOCKHEED MARTIN CORPORATION198: F-3521306624516.7
LOCKHEED MARTIN CORPORATION265: F/A-22 19154681175.1
MCDONNELL DOUGLAS CORPORATION549: F/A-18 E/F 18911200956.0
THE BOEING COMPANY:13936870961.3
THE BOEING COMPANY334: P-8A11100547106.4
BOEING COMPANY, THE:10720289251.5
MCDONNELL DOUGLAS CORPORATION:10702937261.0
THE BOEING COMPANY387: KC-46A10075967071.1
LOCKHEED MARTIN CORPORATION220: C130-J 9371996889.05
LOCKHEED MARTIN CORPORATION220: C-130J 8456177123.0
BELL BOEING JOINT PROJECT OFFICE212: V22 6014772197.5
MCDONNELL DOUGLAS CORPORATION,:5627290260.64
THE BOEING COMPANY311: POSEIDON 5562548109.18
LOCKHEED MARTIN CORPORATION (3632)327: C-5 RERP 4954447529.74
BELL BOEING JOINT PROJECT OFFICE212: V-22 4897019529.28
NORTHROP GRUMMAN SYSTEMS CORPORATION364: E-2D AHE4121105384.0
THE BOEING COMPANY334: MMA 4068738438.96

(rows: 20, time: 1.8s, 3GB processed, job: job_AQ7Fn2qUc0_ZVAnxvXFtUhZgntc)

In [32]:
%%bq query
select
vendorname,
systemequipmentcode,
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
and contractingofficerbusinesssizedetermination in ('S: SMALL BUSINESS')
group by 1,2,3,4
order by dollarsobligated DESC
limit 20


Out[32]:
vendornamesystemequipmentcodeclaimantprogramcodeprincipalnaicscodedollarsobligated
SABRELINER CORPORATION3ADH: A1A: AIRFRAMES AND SPARES336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING115828265.0
TKC AEROSPACE::541330: ENGINEERING SERVICES57520022.28
MISSION RESEARCH CORPORATION:A1A: AIRFRAMES AND SPARES336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING53402855.9
ORION AIR GROUP, LLC:A1C: OTHER AIRCRAFT EQUIPMENT 336411: AIRCRAFT MANUFACTURING50000000.0
AEROVIRONMENT INCORPORATED:A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING44710062.38
DYNAMIC AVIATION GROUP, INC.:A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING39218969.0
TKC AEROSPACE INC.::336411: AIRCRAFT MANUFACTURING33869649.97
TKC AEROSPACE::336411: AIRCRAFT MANUFACTURING31932233.4
TKC AEROSPACE, INC.::336411: AIRCRAFT MANUFACTURING31925543.76
DME CORPORATIONINC3000: A1C: OTHER AIRCRAFT EQUIPMENT 336411: AIRCRAFT MANUFACTURING31710250.0
CYMSTAR LLC:A1C: OTHER AIRCRAFT EQUIPMENT 336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING31553729.07
PHOENIX HELIPARTS INC.::336411: AIRCRAFT MANUFACTURING24694347.75
SIERRA NEVADA CORPORATION:A1C: OTHER AIRCRAFT EQUIPMENT 336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING24145609.0
MISSION RESEARCH CORPORATION3000: A1A: AIRFRAMES AND SPARES336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING23700000.0
SABRELINER CORPORATION:A1A: AIRFRAMES AND SPARES336413: OTHER AIRCRAFT PARTS AND AUXILIARY EQUIPMENT MANUFACTURING22307953.0
AIRCRAFT PORTFOLIO MANAGEMENT:A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING22012000.0
MIDWEST JET CENTER, LLC::481211: NONSCHEDULED CHARTERED PASSENGER AIR TRANSPORTATION17280000.0
FLIGHTSAFETY INTERNATIONAL::611512: FLIGHT TRAINING15534920.0
AIR TRACTOR INCORPORATED::336411: AIRCRAFT MANUFACTURING15490012.28
KIG TECHNICAL SERVICES, LLC:A1A: AIRFRAMES AND SPARES336411: AIRCRAFT MANUFACTURING14998282.59

(rows: 20, time: 1.7s, 8GB processed, job: job_Jxmlr6OBq0IRlMiMMXNZ_GLQAK0)

In [190]:
%%bq query
select
*
from `gpqueries.contracts.raw`
where productorservicecode in ('1510: AIRCRAFT, FIXED WING')
and contractingofficerbusinesssizedetermination in ('S: SMALL BUSINESS')
limit 1


Out[190]:
unique_transaction_idtransaction_statusdollarsobligatedbaseandexercisedoptionsvaluebaseandalloptionsvaluemaj_agency_catmod_agencymaj_fund_agency_catcontractingofficeagencyidcontractingofficeidfundingrequestingagencyidfundingrequestingofficeidfundedbyforeignentitysigneddateeffectivedatecurrentcompletiondateultimatecompletiondatelastdatetoordercontractactiontypereasonformodificationtypeofcontractpricingpriceevaluationpercentdifferencesubcontractplanlettercontractmultiyearcontractperformancebasedservicecontractmajorprogramcodecontingencyhumanitarianpeacekeepingoperationcontractfinancingcostorpricingdatacostaccountingstandardsclausedescriptionofcontractrequirementpurchasecardaspaymentmethodnumberofactionsnationalinterestactioncodeprogsourceagencyprogsourceaccountprogsourcesubacctaccount_titlerec_flagtypeofidcmultipleorsingleawardidcprogramacronymvendornamevendoralternatenamevendorlegalorganizationnamevendordoingasbusinessnamedivisionnamedivisionnumberorofficecodevendorenabledvendorlocationdisableflagccrexceptionstreetaddressstreetaddress2streetaddress3citystatezipcodevendorcountrycodevendor_state_codevendor_cdcongressionaldistrictvendorsitecodevendoralternatesitecodedunsnumberparentdunsnumberphonenofaxnoregistrationdaterenewaldatemod_parentlocationcodestatecodePlaceofPerformanceCitypop_state_codeplaceofperformancecountrycodeplaceofperformancezipcodepop_cdplaceofperformancecongressionaldistrictpsc_catproductorservicecodesystemequipmentcodeclaimantprogramcodeprincipalnaicscodeinformationtechnologycommercialitemcategorygfe_gfpuseofepadesignatedproductsrecoveredmaterialclausesseatransportationcontractbundlingconsolidatedcontractcountryoforiginplaceofmanufacturemanufacturingorganizationtypeagencyidpiidmodnumbertransactionnumberfiscal_yearidvagencyididvpiididvmodificationnumbersolicitationidextentcompetedreasonnotcompetednumberofoffersreceivedcommercialitemacquisitionprocedurescommercialitemtestprogramsmallbusinesscompetitivenessdemonstrationprograma76actioncompetitiveproceduressolicitationprocedurestypeofsetasidelocalareasetasideevaluatedpreferencefedbizoppsresearchstatutoryexceptiontofairopportunityorganizationaltypenumberofemployeesannualrevenuefirm8aflaghubzoneflagsdbflagissbacertifiedsmalldisadvantagedbusinessshelteredworkshopflaghbcuflageducationalinstitutionflagwomenownedflagveteranownedflagsrdvobflaglocalgovernmentflagminorityinstitutionflagaiobflagstategovernmentflagfederalgovernmentflagminorityownedbusinessflagapaobflagtribalgovernmentflagbaobflagnaobflagsaaobflagnonprofitorganizationflagisothernotforprofitorganizationisforprofitorganizationisfoundationhaobflagishispanicservicinginstitutionemergingsmallbusinessflaghospitalflagcontractingofficerbusinesssizedeterminationis1862landgrantcollegeis1890landgrantcollegeis1994landgrantcollegeisveterinarycollegeisveterinaryhospitalisprivateuniversityorcollegeisschoolofforestryisstatecontrolledinstitutionofhigherlearningisserviceproviderreceivescontractsreceivesgrantsreceivescontractsandgrantsisairportauthorityiscouncilofgovernmentsishousingauthoritiespublicortribalisinterstateentityisplanningcommissionisportauthorityistransitauthorityissubchapterscorporationislimitedliabilitycorporationisforeignownedandlocatedisarchitectureandengineeringisdotcertifieddisadvantagedbusinessenterpriseiscitylocalgovernmentiscommunitydevelopedcorporationownedfirmiscommunitydevelopmentcorporationisconstructionfirmismanufacturerofgoodsiscorporateentitynottaxexemptiscountylocalgovernmentisdomesticshelterisfederalgovernmentagencyisfederallyfundedresearchanddevelopmentcorpisforeigngovernmentisindiantribeisintermunicipallocalgovernmentisinternationalorganizationislaborsurplusareafirmislocalgovernmentownedismunicipalitylocalgovernmentisnativehawaiianownedorganizationorfirmisotherbusinessororganizationisotherminorityownedispartnershiporlimitedliabilitypartnershipisschooldistrictlocalgovernmentissmallagriculturalcooperativeissoleproprietorshipistownshiplocalgovernmentistriballyownedfirmistribalcollegeisalaskannativeownedcorporationorfirmiscorporateentitytaxexemptiswomenownedsmallbusinessisecondisadvwomenownedsmallbusinessisjointventurewomenownedsmallbusinessisjointventureecondisadvwomenownedsmallbusinesswalshhealyactservicecontractactdavisbaconactclingercohenactotherstatutoryauthorityprime_awardee_executive1prime_awardee_executive1_compensationprime_awardee_executive2prime_awardee_executive2_compensationprime_awardee_executive3prime_awardee_executive3_compensationprime_awardee_executive4prime_awardee_executive4_compensationprime_awardee_executive5prime_awardee_executive5_compensationinteragencycontractingauthoritylast_modified_date
1890f27bde114a80357a916d3212edb6active166000.00.00.09514: OCCUPATIONAL SAFETY AND HEALTH REVIEW COMMISSION1226: FOREST SERVICE9514: OCCUPATIONAL SAFETY AND HEALTH REVIEW COMMISSION1226: FOREST SERVICE091S8: DEPT OF AGRIC/FOREST SERVICE1226: FOREST SERVICE:Not Applicable03/15/200103/15/200103/15/200303/15/200301/01/1900DCA Definitive Contract:K: Fixed Price with Economic Price Adjustment0.0B: PLAN NOT REQUIRED XN: NoX: NOT APPLICABLE:::XN: No1NONE: NONE:COMMANDER NORTHWEST LIMITED200 AIRPORT WAYEAST WENATCHEEWA988028746UNITED STATESWAWA04WA0401223047501223047501/01/190001/01/1900COMMANDER NORTHWEST LIMITED54652CA: CALIFORNIACA: USA: UNITED STATESCA00CA00151510: AIRCRAFT, FIXED WING::115310: SUPPORT ACTIVITIES FOR FORESTRY:N: Transaction does not use GFE/GFP:::D: NOT A BUNDLED REQUIREMENTN: NoUSA::1226: FOREST SERVICE091S8200103D5591S801AF19002001A: FULL AND OPEN COMPETITION:2A: N: NoN: NONo:::Y00.0NNNNNNNNNNNNNNNNNNNNNNNNNNNNNS: SMALL BUSINESSNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN: NoY: YesX: Not ApplicableN: No0.00.00.00.00.0X: Not Applicable06/27/2001

(rows: 1, time: 2.1s, 83GB processed, job: job_huiMES8smOVDiBF3FFl2ocZ8ZP0)

In [33]:
%%bq query
select 
claimantprogramcode,
principalnaicscode,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1,2
order by dollarsobligated DESC
limit 10


Out[33]:
claimantprogramcodeprincipalnaicscodedollarsobligated
C2: CONSTRUCTION236220: COMMERCIAL AND INSTITUTIONAL BUILDING CONSTRUCTION60403437335.8
:541519: OTHER COMPUTER RELATED SERVICES43206577283.3
S1: SERVICES541330: ENGINEERING SERVICES42325981781.3
:236220: COMMERCIAL AND INSTITUTIONAL BUILDING CONSTRUCTION30723658597.9
:541512: COMPUTER SYSTEMS DESIGN SERVICES23153068951.3
:541611: ADMINISTRATIVE MANAGEMENT AND GENERAL MANAGEMENT CONSULTING SERVICES23053212686.1
S1: SERVICES561210: FACILITIES SUPPORT SERVICES22290451538.3
:541511: CUSTOM COMPUTER PROGRAMMING SERVICES18313101434.9
C2: CONSTRUCTION237990: OTHER HEAVY AND CIVIL ENGINEERING CONSTRUCTION17921409986.7
::17431988986.3

(rows: 10, time: 2.0s, 4GB processed, job: job_Gx2HLp18ACGjNTN_6tPfJoPzMFk)

Looking at SMBs by year


In [34]:
%%bq query -n df_query
select 
fiscal_year,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
group by 1
order by 1

In [35]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
plt = df.set_index('fiscal_year')['dollarsobligated'].plot(kind='bar', title='transactions amount for SMBs')



In [36]:
%%bq query -n df_query
#%%sql
select 
smb.fiscal_year,
sum(smb.transaction) as smb,
sum(total.transaction) as total,
sum(smb.transaction)/sum(total.transaction) as percentage
from
(select 
  fiscal_year,
  sum(dollarsobligated) as transaction
  from `fiery-set-171213.vrec.usa_spending_all`
  where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
  group by 1) as smb
join 
(select
  fiscal_year,
  sum(dollarsobligated) as transaction
  from `fiery-set-171213.vrec.usa_spending_all`
  group by 1) as total
on smb.fiscal_year = total.fiscal_year
group by 1
order by 1

In [37]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
plt = df.set_index('fiscal_year')['percentage'].plot(kind='bar', title='dollars % for SMBs')


SMB contract by gov. agency & by naics code


In [38]:
%%bq query
select 
smb.principalnaicscode as principalnaicscode,
sum(total.count) as count,
sum(smb.dollarsobligated) as dollarsobligated_smb,
sum(total.dollarsobligated) as dollarsobligated_total,
sum(smb.dollarsobligated)/sum(total.dollarsobligated) as smb_percentage
from
(select 
  principalnaicscode,
  count(*) as count,
  sum(dollarsobligated) as dollarsobligated
  from `fiery-set-171213.vrec.usa_spending_all`
  where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
  group by 1) as smb
join 
(select
  principalnaicscode,
  count(*) as count,
  sum(dollarsobligated) as dollarsobligated
  from `fiery-set-171213.vrec.usa_spending_all`
  group by 1
  having dollarsobligated > 0
) as total
on smb.principalnaicscode = total.principalnaicscode
group by 1
order by 5 DESC
limit 10


Out[38]:
principalnaicscodecountdollarsobligated_smbdollarsobligated_totalsmb_percentage
111334: BERRY (EXCEPT STRAWBERRY) FARMING4983928.07277.1311.5331181386
5181114308270781.41150208.477.19068031207
3231155795535.4214640.236.52554092388
421210857041.9133780.441.68860766763
53131137916838508.1611813329.961.42538202328
337125231719004.07551770.311.30308582569
331411: PRIMARY SMELTING AND REFINING OF COPPER29830042581.8823504524.941.27816162874
3334121011009259.61830330.811.21549098003
33341143513352294.7612155472.221.0984595677
11232017106395.199085.51.07377063243

(rows: 10, time: 2.2s, 3GB processed, job: job_Rpl-wBfG4ITaor0UqqBStTD38Ts)

Simple Linear regression (LR)

LR: predict the size of the contract

  • A lot of categorical feature -> needs to binarize it -> creates very sparse Matrix -> poor performance for LR
  • R square of to 2%
  • Not ideal for the problem we are trying for here

In [176]:
%%bq query -n df_query
select
maj_agency_cat,
#mod_agency,
#contractactiontype,
#typeofcontractpricing,
#performancebasedservicecontract,
state,
#vendorcountrycode,
#principalnaicscode,
contractingofficerbusinesssizedetermination,
#sum(dollarsobligated) as dollarsobligated
dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and dollarsobligated > 0
#group by 1,2,3
limit 20000

In [177]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()


Out[177]:
maj_agency_cat state contractingofficerbusinesssizedetermination dollarsobligated
0 3600: VETERANS AFFAIRS, DEPARTMENT OF AL S: SMALL BUSINESS 2971.20
1 3600: VETERANS AFFAIRS, DEPARTMENT OF AL S: SMALL BUSINESS 700000.00
2 1400: INTERIOR, DEPARTMENT OF THE AR S: SMALL BUSINESS 35549.00
3 4700: GENERAL SERVICES ADMINISTRATION AZ S: SMALL BUSINESS 43.72
4 3600: VETERANS AFFAIRS, DEPARTMENT OF AZ S: SMALL BUSINESS 40504.63

In [178]:
# Create dummy variable using pandas function get_dummies
df1 = df.join(pd.get_dummies(df['maj_agency_cat']))
df1 = df1.join(pd.get_dummies(df['state']))
df1 = df1.join(pd.get_dummies(df['contractingofficerbusinesssizedetermination']))

df1 = df1.drop('maj_agency_cat', axis = 1)
df1 = df1.drop('state', axis = 1)
df1 = df1.drop('contractingofficerbusinesssizedetermination', axis = 1)
df1.head()


Out[178]:
dollarsobligated 1100: EXECUTIVE OFFICE OF THE PRESIDENT 1145: PEACE CORPS 1153: UNITED STATES TRADE AND DEVELOPMENT AGENCY 1200: AGRICULTURE, DEPARTMENT OF 1300: COMMERCE, DEPARTMENT OF 1300: Department of Commerce 1400: Department of the Interior 1400: INTERIOR, DEPARTMENT OF THE 1500: Department of Justice ... VI VT WA WI WV WY hi ma O: OTHER THAN SMALL BUSINESS S: SMALL BUSINESS
0 2971.20 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
1 700000.00 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
2 35549.00 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 1
3 43.72 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
4 40504.63 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 137 columns


In [179]:
train_data = df1.iloc[:,1:]
train_labels = df[['dollarsobligated']]

lm = LinearRegression()
lm.fit(train_data, train_labels)

# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((lm.predict(train_data) - train_labels) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(train_data, train_labels))


('Coefficients: \n', array([[  6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,   6.98552231e+17,
          6.98552231e+17,   6.98552231e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -3.98845009e+17,
         -3.98845009e+17,  -3.98845009e+17,  -2.63181572e+18,
         -2.63181572e+18]]))
Mean squared error: 2435668194841.60
Variance score: 0.02

LR: Predict the number of contracts (popularity)

  • Same issue than previously
  • R square of to 1%
  • Not ideal for the problem we are trying for here

In [182]:
%%bq query -n df_query
select
vendorname,
maj_agency_cat,
state,
contractingofficerbusinesssizedetermination,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and dollarsobligated > 0
group by 1,2,3,4
limit 20000

In [183]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()


Out[183]:
vendorname maj_agency_cat state contractingofficerbusinesssizedetermination count dollarsobligated
0 THE HEAT CENTER 1500: JUSTICE, DEPARTMENT OF S: SMALL BUSINESS 1 1181.00
1 NEW HORIZONS TELECOM INC 6900: TRANSPORTATION, DEPARTMENT OF AK S: SMALL BUSINESS 10 1017093.00
2 D AND L WOODWORKS 1400: INTERIOR, DEPARTMENT OF THE AK S: SMALL BUSINESS 1 34520.00
3 BRISTOL CONSTRUCTION SERVICES, 9700: DEPT OF DEFENSE AK S: SMALL BUSINESS 46 23886665.50
4 GRANT AVIATION, INC. 1400: INTERIOR, DEPARTMENT OF THE AK S: SMALL BUSINESS 2 14857.08

In [185]:
#Create dummy variable using pandas function get_dummies
df1 = df.join(pd.get_dummies(df['maj_agency_cat']))
df1 = df1.join(pd.get_dummies(df['state']))
df1 = df1.join(pd.get_dummies(df['contractingofficerbusinesssizedetermination']))

df1 = df1.drop('maj_agency_cat', axis = 1)
df1 = df1.drop('state', axis = 1)
df1 = df1.drop('contractingofficerbusinesssizedetermination', axis = 1)
df1 = df1.drop('vendorname', axis = 1)
df1 = df1.drop('dollarsobligated', axis = 1)

train_data = df1.iloc[:,1:]
train_labels = df[['count']]

lm = LinearRegression()
lm.fit(train_data, train_labels)

# The coefficients
print('Coefficients: \n', lm.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % np.mean((lm.predict(train_data) - train_labels) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % lm.score(train_data, train_labels))


('Coefficients: \n', array([[  1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101469e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101469e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101471e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101469e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101469e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101469e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   1.40101470e+10,
          1.40101470e+10,   1.40101470e+10,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   7.28725805e+13,
          7.28725805e+13,   7.28725805e+13,   1.46975561e+14,
          1.46975561e+14]]))
Mean squared error: 115230.01
Variance score: 0.01

MVP

  • Search query = 'construction'
  • Enter your department name - eg. 'agriculture'
  • Ranking based on 'counts' of number of contracts that occured

  • TO-DO check the uppercase and lowercase in the REGEX

  • Do we want to add more parameters, such as Geo, size of the contract? To be discussed

In [65]:
%%bq query
select
#principalnaicscode,
fiscal_year,
maj_agency_cat,
#contractingofficerbusinesssizedetermination,
#vendorname,
productorservicecode,
count(*) as count,
sum(dollarsobligated) as dollarsobligated
from `fiery-set-171213.vrec.usa_spending_all`
#where contractingofficerbusinesssizedetermination in ("S: SMALL BUSINESS")
#where regexp_contains(principalnaicscode, "CONSTRUCTION")
#and regexp_contains(maj_agency_cat, "AGRICULTURE")
where regexp_contains(productorservicecode, "MEAT")
#and fiscal_year = 2016
group by 1,2,3
order by dollarsobligated DESC
limit 10


Out[65]:
fiscal_yearmaj_agency_catproductorservicecodecountdollarsobligated
20129700: DEPT OF DEFENSE8905: MEAT, POULTRY, AND FISH38811157422202.41
20069700: DEPT OF DEFENSE8905: MEAT, POULTRY, AND FISH84961085248556.57
20119700: DEPT OF DEFENSE8905: MEAT, POULTRY, AND FISH82031020507916.95
20121200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH816854972440.02
20091200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH832847556584.56
20131200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH909837261896.74
20111200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH719828393329.71
20161200: Department of Agriculture8905: MEAT, POULTRY, AND FISH544818812458.77
20101200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH883814609750.75
20141200: AGRICULTURE, DEPARTMENT OF8905: MEAT, POULTRY, AND FISH518794157517.17

(rows: 10, time: 1.9s, 4GB processed, job: job_pyuYsoZOmedah1uGNKd5ZFDmZhw)

MVP 2 - Collaborative filtering

  • If a person A likes item 1, 2, 3 and B like 2,3,4 then they have similar interests and A should like item 4 and B should like item 1.
  • Looking at match between gov mod_agency (275) & vendors (770526)
  • See: https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-systems-in-Python/index.html
  • TO DO - TRAINING (1.9M rows) kernel crashed abover 20K -> Need to Map/Reduce or getting a higher performance machine or use another algorithm (matrix factorization)?
  • TO DO - Think about scaling or binarizing the count data -> to improve results
  • TO DO - Look at match between product service code (5833) & vendors (770526)
  • TO DO - Add Geo filter?
  • TO DO - Already done business with a company?

In [2]:
%%bq query -n df_query
select
contractingofficerbusinesssizedetermination,
mod_agency,
vendorname,
count(*) as count
from `fiery-set-171213.vrec.usa_spending_all`
where vendorcountrycode in ('UNITED STATES', 'USA: UNITED STATES OF AMERICA')
and contractingofficerbusinesssizedetermination in ('O: OTHER THAN SMALL BUSINESS', 'S: SMALL BUSINESS')
and mod_agency not in ("")
group by 1,2,3
order by count DESC
limit 20000

In [3]:
df = df_query.execute(output_options=bq.QueryOutput.dataframe()).result()
df.head()


Out[3]:
contractingofficerbusinesssizedetermination mod_agency vendorname count
0 O: OTHER THAN SMALL BUSINESS 97AS: DEFENSE LOGISTICS AGENCY AMERISOURCEBERGEN DRUG CORPORA 676166
1 O: OTHER THAN SMALL BUSINESS 97AS: DEFENSE LOGISTICS AGENCY CARDINAL HEALTH 200, LLC 419654
2 O: OTHER THAN SMALL BUSINESS 97AS: DEFENSE LOGISTICS AGENCY CARDINAL HEALTH, INC 318861
3 O: OTHER THAN SMALL BUSINESS 4730: FEDERAL ACQUISITION SERVICE NATIONAL INDUSTRIES FOR THE BLIND 308991
4 O: OTHER THAN SMALL BUSINESS 97AS: DEFENSE LOGISTICS AGENCY OWENS & MINOR DISTRIBUTION, INC. 262205

In [4]:
df1 = df.drop('contractingofficerbusinesssizedetermination', axis = 1)
n_agency = df1.mod_agency.unique().shape[0]
n_vendors = df1.vendorname.unique().shape[0]
print 'Number of gov agency = ' + str(n_agency) + ' | Number of vendors = ' + str(n_vendors)


Number of gov agency = 136 | Number of vendors = 14221

In [5]:
# Convert categorial value with label encoding

le_agency = LabelEncoder()
label_agency = le_agency.fit_transform(df1['mod_agency'])

le_vendor = LabelEncoder()
label_vendor = le_vendor.fit_transform(df1['vendorname'])

df_agency = pd.DataFrame(label_agency)
df_vendor = pd.DataFrame(label_vendor)

df2 = pd.concat([df_agency, df_vendor], axis = 1)
df2 = pd.concat([df2, df1['count']], axis = 1)
df2.columns = ['mod_agency', 'vendorname', 'count']
df2.head(5)

# To ge the right label back
# le_agency.inverse_transform([173, 100])


Out[5]:
mod_agency vendorname count
0 124 1025 676166
1 124 2557 419654
2 124 2568 318861
3 67 8917 308991
4 124 9567 262205

In [6]:
# Split into training and test data set
train_data, test_data = cv.train_test_split(df2, test_size=0.25)

In [7]:
#Build the matrix
train_data_matrix = np.zeros((n_agency, n_vendors))
for line in train_data.itertuples():
    train_data_matrix[line[1]-1, line[2]-1] = line[3]

test_data_matrix = np.zeros((n_agency, n_vendors))
for line in test_data.itertuples():
    test_data_matrix[line[1]-1, line[2]-1] = line[3]

#Compute cosine distance
user_similarity = pairwise_distances(train_data_matrix, metric='cosine')
item_similarity = pairwise_distances(train_data_matrix.T, metric='cosine')

def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis])
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
    return pred
  
item_prediction = predict(train_data_matrix, item_similarity, type='item')
user_prediction = predict(train_data_matrix, user_similarity, type='user')

In [8]:
# Evaluation
def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten()
    ground_truth = ground_truth[ground_truth.nonzero()].flatten() #filter out all items with no 0 as we only want to predict in the test set
    return sqrt(mean_squared_error(prediction, ground_truth))

print 'User-based CF RMSE: ' + str(rmse(user_prediction, test_data_matrix))
print 'Item-based CF RMSE: ' + str(rmse(item_prediction, test_data_matrix))


User-based CF RMSE: 4952.2319678
Item-based CF RMSE: 4968.68787794

Worklow 1


a. Collaborative Filtering - user-item prediction


In [10]:
print 'Worklow 1'
print '=' * 100
print 'Select your agency:'
agency = df1['mod_agency'][10]
print agency
print '=' * 100
print '1. Have you considered working with these SMB companies (user prediction)?'

agency = le_agency.transform(agency)
vendor_reco = pd.DataFrame(user_prediction[agency, :])
labels = pd.DataFrame(le_vendor.inverse_transform(range(0, len(vendor_reco))))

df_reco = pd.concat([vendor_reco, labels], axis = 1)
df_reco.columns = ['reco_score', 'vendorname']
#Join to get the SMB list
df_smb = df.drop(['mod_agency', 'count'], axis = 1)
df_reco = df_reco.set_index('vendorname').join(df_smb.set_index('vendorname'))

df_reco = df_reco.sort_values(['reco_score'], ascending = [0])
df_reco[df_reco['contractingofficerbusinesssizedetermination'] == 'S: SMALL BUSINESS'].head(10)


Worklow 1
====================================================================================================
Select your agency:
2100: DEPT OF THE ARMY
====================================================================================================
1. Have you considered working with these SMB companies (user prediction?
Out[10]:
reco_score contractingofficerbusinesssizedetermination
vendorname
MIRATEK CORPORATION 1899.687467 S: SMALL BUSINESS
KAUFMAN COMPANY INC 1249.384436 S: SMALL BUSINESS
KAUFMAN COMPANY INC 1249.384436 S: SMALL BUSINESS
BOH ENVIRONMENTAL, LLC 1154.652817 S: SMALL BUSINESS
OFFICE DEPOT INCORPORATED 982.513064 S: SMALL BUSINESS
SUPPLYCORE INC 799.510650 S: SMALL BUSINESS
GENERAL MARINE PRODUCTS INC 763.104537 S: SMALL BUSINESS
KIPPER TECHNOLOGIES, INC 664.156616 S: SMALL BUSINESS
XENIA MANAGEMENT CORPORATION 618.360854 S: SMALL BUSINESS
FORD ENGLAND INC 608.005796 S: SMALL BUSINESS

b. Collaborative Filtering - item-item prediction


In [34]:
print '=' * 100
print '2. Have you considered working with these SMB companies (item-item prediction?)'
vendor_reco = pd.DataFrame(item_prediction[agency, :])

df_reco = pd.concat([vendor_reco, labels], axis = 1)
df_reco.columns = ['reco_score', 'vendorname']
df_reco = df_reco.set_index('vendorname').join(df_smb.set_index('vendorname'))
df_reco = df_reco.sort_values(['reco_score'], ascending = [0])
df_reco[df_reco['contractingofficerbusinesssizedetermination'] == 'S: SMALL BUSINESS'].head(10)


====================================================================================================
2. Have you considered working with these SMB companies (item-item prediction?
Out[34]:
reco_score contractingofficerbusinesssizedetermination
vendorname
HOLMAN CONSTRUCTION & GENERAL 4.501594 S: SMALL BUSINESS
ZOLL LIFECOR CORPORATION 4.413256 S: SMALL BUSINESS
GENERAL ELECTRIC CAPITAL CORPORATION 4.372654 S: SMALL BUSINESS
DENTRUST DENTAL INTERNATIONAL, INC. 4.368289 S: SMALL BUSINESS
VION CORPORATION 4.368187 S: SMALL BUSINESS
VION CORPORATION 4.368187 S: SMALL BUSINESS
PHOENIX SYSTEMS ENGINEERING, INC. 4.364277 S: SMALL BUSINESS
AGBAYANI CONST CORP 4.353899 S: SMALL BUSINESS
AGBAYANI CONST CORP 4.353899 S: SMALL BUSINESS
B & G MANUFACTURING CO., INC. 4.352746 S: SMALL BUSINESS

In [35]:
print 'Worklow 2'
print '=' * 100
print 'Select a vendor:'

# Workflow 2  - WIP
# Select a vendor
# Other similar vendor


Worklow 2
====================================================================================================
Select a vendor:

OTHERS - FROM TUTORIAL - Anton Tarasenko

Data Mining Government Clients

Suppose you want to start selling to the government. While FBO.gov publishes government RFPs and you can apply there, government agencies often issue requests when they've already chosen the supplier. Agencies go through FBO.gov because it's a mandatory step for deals north of $25K. But winning at this stage is unlikely if an RFP is already tailored for another supplier.

Reaching warm leads in advance would increase chances of winning a government contract. The contracts data helps identify the warm leads by looking at purchases in the previous years.

There're several ways of searching through those years.

Who Buys What You Make

The goods and services bought in each transaction are encoded in the variable productorservicecode. Top ten product categories according to this variable:


In [6]:
%%sql
select
  substr(productorservicecode, 1, 4) product_id,
  first(substr(productorservicecode, 7)) product_name,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
group by
  product_id
order by
  sum_dollarsobligated desc
limit 10


Out[6]:
product_idproduct_nametransactionssum_dollarsobligated
1510AIRCRAFT, FIXED WING343732.84778910095e+11
M181OPER OF GOVT R&D GOCO FACILITIES117052.09034830695e+11
R425ENGINEERING AND TECHNICAL SERVICES5401282.01977381246e+11
D399OTHER ADP & TELECOMMUNICATIONS SVCS5440691.79749708292e+11
R499OTHER PROFESSIONAL SERVICES8217881.63569940538e+11
9130LIQUID PROPELLANTS -PETROLEUM BASE1932671.47092818248e+11
Q201GENERAL HEALTH CARE SERVICES1571881.43893967601e+11
6505DRUGS AND BIOLOGICALS18572441.40279701467e+11
R706LOGISTICS SUPPORT SERVICES1472381.17202582504e+11
R408PROGRAM MANAGEMENT/SUPPORT SERVICES27090489610029462.5

(rows: 10, time: 2.1s, 2GB processed, job: job_KYbYjkFzkQPN9w-IUjwnrxZ9nbM)

You can find agencies that buy products like yours. If it's "software":


In [7]:
%%sql
select
  substr(agencyid, 1, 4) agency_id,
  first(substr(agencyid, 7)) agency_name,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
where
  productorservicecode contains 'software'
group by
  agency_id
order by
  sum_dollarsobligated desc
ignore case


Out[7]:
agency_idagency_nametransactionssum_dollarsobligated
9700DEPT OF DEFENSE24501531756394755.4
3600VETERANS AFFAIRS, DEPARTMENT OF212233213678011.54
2050INTERNAL REVENUE SERVICE132922625458725.97
2800SOCIAL SECURITY ADMINISTRATION70382383445352.45
4735FEDERAL ACQUISITION SERVICE150642374494872.16
7014U.S. CUSTOMS AND BORDER PROTECTION34561816831446.67
1900STATE, DEPARTMENT OF149641395765461.73
1549FEDERAL BUREAU OF INVESTIGATION46731104923392.94
7001OFFICE OF PROCUREMENT OPERATIONS26041076395157.89
1501OFFICES, BOARDS AND DIVISIONS46171057842936.58
1205USDA, OFFICE OF THE CHIEF FINANCIAL OFFICER7812943296547.34
7529NATIONAL INSTITUTES OF HEALTH8434829980910.42
6800ENVIRONMENTAL PROTECTION AGENCY8291790956089.3
4732FEDERAL ACQUISITION SERVICE3335766033684.66
7013TRANSPORTATION SECURITY ADMINISTRATION2182740364381.42
1406OFFICE OF POLICY, MANAGEMENT, AND BUDGET5080724577419.11
1323BUREAU OF THE CENSUS3101591701336.21
2036BUREAU OF THE FISCAL SERVICE7385582297546.58
1330NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION10758578208631.36
7008U.S. COAST GUARD4265574943946.72
6920FEDERAL AVIATION ADMINISTRATION1491534520819.72
7530CENTERS FOR MEDICARE AND MEDICAID SERVICES926519913926.84
1344PATENT AND TRADEMARK OFFICE3382443011847.34
7012U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT2989441316095.41
8000NATIONAL AERONAUTICS AND SPACE ADMINISTRATION10679430834604.06

(rows: 206, time: 1.6s, 3GB processed, job: job_QM68bqLsL3fjNNEqTxSdcWru7AE)

What Firms in Your Industry Sell to the Government

Another way to find customers is the variable principalnaicscode that encodes the industry in which the vendor does business.

The list of NAICS codes is available at Census.gov, but you can do text search in the table. Let's find who bought software from distributors in 2015:


In [6]:
%%sql
select
  substr(agencyid, 1, 4) agency_id,
  first(substr(agencyid, 7)) agency_name,
  substr(principalnaicscode, 1, 6) naics_id,
  first(substr(principalnaicscode, 9)) naics_name,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
where
  principalnaicscode contains 'software' and
  fiscal_year = 2015
group by
  agency_id, naics_id
order by
  sum_dollarsobligated desc
ignore case


Out[6]:
agency_idagency_namenaics_idnaics_nametransactionssum_dollarsobligated
9700DEPT OF DEFENSE511210SOFTWARE PUBLISHERS79071086588808.06
9700DEPT OF DEFENSE443120COMPUTER AND SOFTWARE STORES5313451131013.69
9700DEPT OF DEFENSE423430COMPUTER AND COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE MERCHANT WHOLESALERS2073246620960.81
7014U.S. CUSTOMS AND BORDER PROTECTION443120COMPUTER AND SOFTWARE STORES294195178029.07
2800SOCIAL SECURITY ADMINISTRATION443120COMPUTER AND SOFTWARE STORES201165682384.14
2050INTERNAL REVENUE SERVICE443120COMPUTER AND SOFTWARE STORES272105303702.28
1549FEDERAL BUREAU OF INVESTIGATION443120COMPUTER AND SOFTWARE STORES14794253371.19
1900STATE, DEPARTMENT OF443120COMPUTER AND SOFTWARE STORES119691138323.46
2800SOCIAL SECURITY ADMINISTRATION511210SOFTWARE PUBLISHERS24489509652.8
3600VETERANS AFFAIRS, DEPARTMENT OF511210SOFTWARE PUBLISHERS63087867702.88
7001OFFICE OF PROCUREMENT OPERATIONS443120COMPUTER AND SOFTWARE STORES11585158432.43
2050INTERNAL REVENUE SERVICE511210SOFTWARE PUBLISHERS32182585967.45
6920FEDERAL AVIATION ADMINISTRATION423430COMPUTER AND COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE MERCHANT WHOLESALERS32081678563.94
1501OFFICES, BOARDS AND DIVISIONS443120COMPUTER AND SOFTWARE STORES12475303017.16
7013TRANSPORTATION SECURITY ADMINISTRATION443120COMPUTER AND SOFTWARE STORES13072338222.09
1344PATENT AND TRADEMARK OFFICE443120COMPUTER AND SOFTWARE STORES17960555711.83
1900STATE, DEPARTMENT OF423430COMPUTER AND COMPUTER PERIPHERAL EQUIPMENT AND SOFTWARE MERCHANT WHOLESALERS144959498401.27
9700DEPT OF DEFENSE334614SOFTWARE AND OTHER PRERECORDED COMPACT DISC, TAPE, AND RECORD REPRODUCING 84754775397.52
7529NATIONAL INSTITUTES OF HEALTH443120COMPUTER AND SOFTWARE STORES21254773671.43
1900STATE, DEPARTMENT OF511210SOFTWARE PUBLISHERS35739829177.68
6800ENVIRONMENTAL PROTECTION AGENCY443120COMPUTER AND SOFTWARE STORES26039649773.39
7200AGENCY FOR INTERNATIONAL DEVELOPMENT443120COMPUTER AND SOFTWARE STORES6337272308.46
1205USDA, OFFICE OF THE CHIEF FINANCIAL OFFICER443120COMPUTER AND SOFTWARE STORES26335421398.98
3600VETERANS AFFAIRS, DEPARTMENT OF443120COMPUTER AND SOFTWARE STORES47134401225.47
4732FEDERAL ACQUISITION SERVICE443120COMPUTER AND SOFTWARE STORES75734066865.64

(rows: 463, time: 1.0s, cached, job: job_tyO5qqBhadXXwwH8p2u9qTutzoA)

Inspecting Specific Transactions

You can learn details from looking at transactions for a specific (agency, NAICS) pair. For example, what software does TSA buy?


In [7]:
%%sql
select
  fiscal_year,
  dollarsobligated,
  vendorname, city, state, annualrevenue, numberofemployees,
  descriptionofcontractrequirement
from
  gpqueries:contracts.raw
where
  agencyid contains 'transportation security administration' and
  principalnaicscode contains 'computer and software stores'
ignore case


Out[7]:
fiscal_yeardollarsobligatedvendornamecitystateannualrevenuenumberofemployeesdescriptionofcontractrequirement
200375480.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00
200364800.0AVAYA INCARLINGTONVA0.00
20160.0IMMIXTECHNOLOGY, INC.MCLEANVA31929440.0162CLOSEOUT SERVICE
20085483.8SPECTRUM SYSTEMS, INC.FAIRFAXVA17000000.025IT SECURITY PR TO FUND ANNUAL LICENSE RENEWAL FOR SPIDYNAMICS WEBINSPECT IN THE AMOUNT OF $4,500 PR AMENDED- TO MATCH LOWEST QUOTE AND TO EXTEND SERVICE THROUGH 12/31/2009.
2007557880.0GTSI CORPORATIONCHANTILLYVA900000000.0850THIS PROCUREMENT REQUEST IS FOR 250 D620 TRACE MACHINES FOR THE FSDS PER THE ATTACHED CONFIGURATION. DELIVERY SCHEDULE AS FOLLOWS: DEC 18 - 25 JAN 8 - 25 JAN 15 - 50 JAN 29 - 50 FEB 12 - 50 FEB 26 - 50
20041640913.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IT SERVICES
201336372.5IMMIXTECHNOLOGY, INC.MCLEANVA31929440.0162THE PURPOSE OF THIS BPA CALL IS TO PURCHASE A QUANTITY OF 250 POE WITH BATTERY BACKUP (PART # 8602806-001)TERMINAL CLOCK PRODUCTS.
20135373123.87INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::OT::IGF SUBJECT: APPLICATION SUPPORT TEAM (AST) - OASIS II WORK ORDER
200816077.46SPRINT COMMUNICATIONS COMPANY L.P.RESTONVA38927998976.056001QSEC AIRTIME FOR A BASE AND TWO ADDITIONAL OPTION PERIODS.
2005100000.0ORACLE CORPORATIONRESTONVA0.00AFSP ORACLE SUPPORT
20160.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::CT::IGF THE PURPOSE OF MODIFICATION P00004 TO OASIS II TASK ORDER #21 (HSTS03-14-J-SAS001) IS TO REALLOCATE AMONG TWO CLIN 10001 LABOR CATEGORIES FOR THE OPTION PERIOD 1.
20160.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::CT::IGF THIS PURPOSE OF THIS MODIFICATION IS TO DISASSOCIATE FISCAL YEAR 2015/2016 LINE OF ACCOUNTING (LOA), DEOBLIGATE PERVIOUS CLINS AND ADD NEW SUB-CLINS WITH FISCAL YEAR 2016 LINE OF ACCOUNTING (LOA). THIS IS A TIME AND MATERIALS TASK ORDER ISSUED AGAINST BPA HSTS03-13-A-CIO549 TO PROVIDE SUPPORT AND MAINTENANCE FOR THE AUTOMATED TIME AND ATTENDANCE (T&A) SYSTEM FOR THE TRANSPORTATION SECURITY OFFICER STAFF IN ACCORDANCE WITH THE INCLUDED PERFORMANCE WORK STATEMENT AND TERMS AND CONDITIONS.
2016411498.48INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::OT::IGF RAPID APPLICATION DEVELOPMENT (RAD) II CONTINUED SUPPORT SERVICES
2005317680.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00
2005413075.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IT SERVICES
20081834.02TESSCO INCORPORATEDHUNT VALLEYMD513000000.0592MOBILE AMPLIFIERS
200814906.67NETPRO COMPUTING, INC.PHOENIXAZ9731484.088SOFTWARE MAINTENANCE TO PROVIDE FY08 FUNDING IN THE AMOUNT OF $16,500 FOR IT SOFTWARE MAINTENANCE FOR NETPRO FOR THE PERIOD NOV 22, 2007 TO DEC 31, 2008.
20070.0TALX CORPORATIONSAINT LOUISMO45820000.0878TALX - ADDITIONAL FUNDS TO EXTEND TASK ORDER WITH USDA FOR THE TALX UC EXPESS PROGRAM. FUNDS SHALL COVER 4TH QUARTER OF FY06
20073939.74POLYCOM, INCPLEASANTONCA465959000.01271THE PURPOSE OF THIS PROCUREMENT IS TO EXTEND THE POLYCOM PREMIER PLUS SERVICE MAINTENANCE AGREEMENT ON TWO (2) VIDEO CONFERENCING UNITS FOR AN ADDITIONAL YEAR. PRESENT AGREEMENT EXPIRES ON 3-15-07. THE TOTAL COST OF THIS PROCUREMENT IS $3,939.74.
20165000.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::OT::IGF THE PURPOSE OF MODIFICATION P00003 TO OASIS II TASK ORDER #38 (HSTS03-14-J-SAS005) IS TO ADD CLIN 10007 TO OPTION PERIOD 1 AND CHANGE THE CONTRACTING OFFICER REPRESENTATIVE FROM DEREK SMITH TO SUNG LEE (VALERIE SANDY AS ALTERNATE CONTRACTING OFFICER REPRESENTATIVE)
20099379.91SYNNEX CORPORATIONGREENVILLESC4000000000.0550LCD MONITORS FOR PUBLIC SERVICE ANNOUNCEMENTS AT THE THREE SPOKE AIRPORTS AND TWO MONITORS FOR THE HUB COORDINATING CENTER. 10 IDENTICAL ITEMS WERE PURCHASED FOR THE HUB LAST YEAR UNDER CONTRACT GS-35F-4076D
20090.0GTSI CORPORATIONCHANTILLYVA900000000.0850PURCHASE OF ADOBE LICENSES FOR THE OFFICE OF THE SPECIAL COUNSELOR
2011415522.0GTSI CORPORATIONCHANTILLYVA900000000.0850MAINTENANCE AND SUPPORT FOR SYMANTEC AND ALTIRIS LICENSES.
2006152880.0GTSI CORPORATIONCHANTILLYVA900000000.0850DELL COMPUTERS
20160.0INTERNATIONAL BUSINESS MACHINEBETHESDAMD0.00IGF::CT::IGF ISHARE AND WORKFLOW APPLICATION DEVELOPMENT SERVICES.

(rows: 584, time: 0.7s, cached, job: job_16eFmMJh3axW3zdfW7ioeWdHVE0)

Alternatively, specify vendors your product relates to and check how the government uses it. Top deals in data analytics:


In [8]:
%%sql
select
  agencyid,
  dollarsobligated,
  vendorname,
  descriptionofcontractrequirement
from
  gpqueries:contracts.raw
where
  vendorname contains 'tableau' or
  vendorname contains 'socrata' or
  vendorname contains 'palantir' or
  vendorname contains 'revolution analytics' or
  vendorname contains 'mathworks' or
  vendorname contains 'statacorp' or
  vendorname contains 'mathworks'
order by
  dollarsobligated desc
limit
  100
ignore case


Out[8]:
agencyiddollarsobligatedvendornamedescriptionofcontractrequirement
9700: DEPT OF DEFENSE22401901.14PALANTIR TECHNOLOGIES INCORPORATEDPROCUREMENT OF PALANTIR GOTHAM SOFTWARE AND ASSOCIATED SUPPORT
9700: DEPT OF DEFENSE19243057.44PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR CORE SERVER LICENSES AND HARDWARE
1549: FEDERAL BUREAU OF INVESTIGATION14891225.66PALANTIR TECHNOLOGIES INC.UNLIMITED LICENSE
7012: U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT12472821.2PALANTIR USG, INC.ICE COTS BASED, WEB-ENABLED INVESTIGATIVE CASE MANAGEMENT (ICM) SYSTEM FOR ICE TECS MOD PROGRAM OFFICE. IGF::CT::IGF
9700: DEPT OF DEFENSE11999886.24PALANTIR TECHNOLOGIES INCORPORATEDIGF::OT::IGF EXERCISE OPTION YEAR 1 AND ADD FUNDING
9700: DEPT OF DEFENSE11805943.2PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR SW LICENSE MAINTENANCE SUPPORT RENEWAL.
7012: U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT9900000.0PALANTIR TECHNOLOGIES INC.IGF::OT::IGF FALCON OPERATIONS AND MAINTENANCE (O&M), SYSTEM ENHANCEMENT SUPPORT SERVICES FOR PALANTIR GOVERNMENT
7012: U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT9857193.77PALANTIR USG, INC.ICE COTS BASED, WEB-ENABLED INVESTIGATIVE CASE MANAGEMENT (ICM) SYSTEM FOR ICE TECS MOD PROGRAM OFFICE. IGF::CT::IGF
1501: OFFICES, BOARDS AND DIVISIONS9500000.0PALANTIR TECHNOLOGIES INC.IGF::OT::IGF
9700: DEPT OF DEFENSE9125264.2PALANTIR TECHNOLOGIES INCORPORATEDIFAP
9700: DEPT OF DEFENSE7180568.16PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR SW LICENSE MAINTENANCE SUPPORT RENEWAL.
1501: OFFICES, BOARDS AND DIVISIONS6500000.0PALANTIR TECHNOLOGIES INC.IGF::OT::IGF
5000: SECURITIES AND EXCHANGE COMMISSION5999999.98PALANTIR TECHNOLOGIES INC.IGF::OT::IGF FOR OTHER FUNCTIONS PROVIDE AND IMPLEMENT A COTS SOLUTION FOR THE AGENCY'S ENTERPRISE DATA ANALYTICS
7012: U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT5936000.0PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR SOFTWARE, SERVICE, AND SUPPORT FOR HSI PERSONNEL
9700: DEPT OF DEFENSE5724116.36PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR GOTHAM SERVER LICENSES AND ASSOCIATED FIELD SERVICE REPRESENTATIVE SUPPORT.
9700: DEPT OF DEFENSE5717851.36PALANTIR TECHNOLOGIES INCORPORATEDPALANTIR GOTHAM LICENSES AND FIELD SERVICE REPRESENTATIVE SUPPORT
9700: DEPT OF DEFENSE5646384.24PALANTIR TECHNOLOGIES INCORPORATEDPURCHASE OF PALANTIR SOFTWARE LICENSES&FSR SUPPORT
7012: U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT5038025.59PALANTIR TECHNOLOGIES INC.IGF::OT::IGF FALCON OPERATIONS AND MAINTENANCE (O&M) SUPPORT SERVICES FOR PALANTIR GOVERNMENT, A PROPRIETARY SOFTWARE PRODUCT OF PALANTIR TECHNOLOGIES, INC.
9700: DEPT OF DEFENSE4700000.0PALANTIR TECHNOLOGIES INCORPORATEDNEW WORK MOD
1501: OFFICES, BOARDS AND DIVISIONS4690647.07PALANTIR TECHNOLOGIES INC.IGF::OT::IGF
9700: DEPT OF DEFENSE4661082.38PALANTIR TECHNOLOGIES INCORPORATEDIGF::OT::IGF PALANTIR MARINES IFAP III CORE LICENSES.
1501: OFFICES, BOARDS AND DIVISIONS4579393.0PALANTIR TECHNOLOGIES INC.IGF::OT::IGF,EOUSA PALANTIR
1549: FEDERAL BUREAU OF INVESTIGATION3609995.45PALANTIR TECHNOLOGIES INCORPORATEDCOMPUTER TOOL USAGE LICENSES
9700: DEPT OF DEFENSE3271065.56PALANTIR TECHNOLOGIES INC.IGF::OT::IGF CAG/GMSC FSR SUPPORT
9700: DEPT OF DEFENSE3203190.77PALANTIR TECHNOLOGIES INC.IGF::OT::IGF, TWO (2) 8-CORE PERPETUAL LICENSES

(rows: 100, time: 0.8s, cached, job: job_GHwVcGuA_3SqI6q4FiMMI4k9yqY)

Searching Through Descriptions

Full-text search and regular expressions for the variable descriptionofcontractrequirement narrow results for relevant product groups:


In [9]:
%%sql
select
  agencyid,
  dollarsobligated,
  descriptionofcontractrequirement
from
  gpqueries:contracts.raw
where
  descriptionofcontractrequirement contains 'body camera'
limit
  100
ignore case


Out[9]:
agencyiddollarsobligateddescriptionofcontractrequirement
9700: DEPT OF DEFENSE5529.0WEARABLE BODY CAMERA
1549: FEDERAL BUREAU OF INVESTIGATION3978.0BODY CAMERAS FOR WARM SPRINGS OFFICERS/DETECTIVES
7008: U.S. COAST GUARD5154.21POLICE BODY CAMERAS
1443: NATIONAL PARK SERVICE12044.42CACO WEARABLE BODY CAMERAS FOR THE LAW ENFORCEMENT RANGERS AT CAPE CODE NATIONAL SEASHORE
9700: DEPT OF DEFENSE4315.0NIKON D3 DIGITAL SLR BODY CAMERA
9700: DEPT OF DEFENSE26285.6BODY CAMERA'S
1443: NATIONAL PARK SERVICE14605.4VIEVU BODY CAMERAS
2036: BUREAU OF THE FISCAL SERVICE23970.0BODY CAMERAS
1443: NATIONAL PARK SERVICE14850.0LARO LE BODY CAMERAS
3600: VETERANS AFFAIRS, DEPARTMENT OF5648.016 BODY CAMERAS FOR POLICE DEPT AT LEXINGTON VAMC
9700: DEPT OF DEFENSE48667.5DES BODY CAMERAS
9700: DEPT OF DEFENSE58290.0NIKON D300S BODY CAMERA
1443: NATIONAL PARK SERVICE4855.03MEVE-TASER AXON BODY CAMERAS
9700: DEPT OF DEFENSE0.0POLICE BODY CAMERA SYSTEM
1448: U.S. FISH AND WILDLIFE SERVICE17133.6RLE-15-0065-BODY CAMERAS
3600: VETERANS AFFAIRS, DEPARTMENT OF7375.0PURCHASE BODY CAMERAS FOR POLICE DEPARTMENT
1443: NATIONAL PARK SERVICE8010.0G: LOWTX, PURCHASE FIRSTVU BODY CAMERA(S)
1443: NATIONAL PARK SERVICE0.0X:NOGRN, WEARABLE BODY CAMERAS, BOSTON NATIONAL HISTORICAL PARK, CHARLESTOWN, MA, 02129
9700: DEPT OF DEFENSE12989.34IN-CAR AND BODY CAMERAS FOR POLICE DEPARTMENT
9700: DEPT OF DEFENSE0.0NIKON D300 BODY CAMERA,
7524: FOOD AND DRUG ADMINISTRATION4420.0IGF::OT::IGF BODY CAMERA
1448: U.S. FISH AND WILDLIFE SERVICE68260.25BODY CAMERAS FOR RLE IGF::OT::IGF
7009: U.S. SECRET SERVICE6959.9611-FSD-PR-NIKON D3X BODY CAMERA FOR USE IN THE FORENSICS SERVICES DIVISION.
9700: DEPT OF DEFENSE22564.15POLICE BODY CAMERA SYSTEM
1443: NATIONAL PARK SERVICE9266.3CANY-TASER BODY CAMERAS

(rows: 53, time: 0.9s, cached, job: job_wIUPYZekHe4jwd-oHsy76OSUD4c)

Some rows of descriptionofcontractrequirement contain codes like "IGF::CT::IGF". These codes classify the purchase into three groups of "Inherently Governmental Functions" (IGF):

  1. IGF::CT::IGF for Critical Functions
  2. IGF::CL::IGF for Closely Associated
  3. IGF::OT::IGF for Other Functions

Narrowing Your Geography

You can find local opportunities using variables for vendors (city, state) and services sold (PlaceofPerformanceCity, pop_state_code). The states where most contracts are delivered in:


In [10]:
%%sql
select
  substr(pop_state_code, 1, 2) state_code,
  first(substr(pop_state_code, 4)) state_name,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
group by
  state_code
order by
  sum_dollarsobligated desc


Out[10]:
state_codestate_namesum_dollarsobligated
CA 7.12598903911e+11
VA VIRGINIA6.97817723705e+11
:6.67791793514e+11
TX TEXAS5.12968110673e+11
MD MARYLAND3.51494588833e+11
DC DISTRICT OF COLUMBIA2.58102498463e+11
PA PENNSYLVANIA2.10745481833e+11
FL FLORIDA2.10241041129e+11
MA MASSACHUSETTS1.84166810727e+11
AZ ARIZONA1.68027358301e+11
CT CONNECTICUT1.6160664469e+11
MO MISSOURI1.58477274895e+11
NY NEW YORK1.50885801473e+11
WA WASHINGTON1.45630044425e+11
GA GEORGIA1.40264459673e+11
AL ALABAMA1.39065794738e+11
CO COLORADO1.19793072249e+11
IL ILLINOIS1.19266682187e+11
NM NEW MEXICO1.11178508246e+11
TN TENNESSEE1.08883193392e+11
OH OHIO1.07798459749e+11
NJ NEW JERSEY1.06487949719e+11
SC SOUTH CAROLINA87058229568.8
KY KENTUCKY85713398240.4
MI MICHIGAN81600382916.1

(rows: 89, time: 1.5s, cached, job: job_xz8OgI40jd2GPlDk9J5fqZjxTDY)

Facts about Government Contracting

Let's check some popular statements about government contracting.

Small Businesses Win Most Contracts

Contractors had to report their revenue and the number of employees. It makes easy to check if small business is welcomed in government contracting:


In [11]:
%%sql --module gpq

define query vendor_size_by_agency
select
  substr(agencyid, 1, 4) agency_id,
  first(substr(agencyid, 7)) agency_name,
  nth(11, quantiles(annualrevenue, 21)) vendor_median_annualrevenue,
  nth(11, quantiles(numberofemployees, 21)) vendor_median_numberofemployees,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
group by
  agency_id
having
  transactions > 1000 and
  sum_dollarsobligated > 10e6
order by
  vendor_median_annualrevenue asc

In [12]:
bq.Query(gpq.vendor_size_by_agency).to_dataframe()


Out[12]:
agency_id agency_name vendor_median_annualrevenue vendor_median_numberofemployees transactions sum_dollarsobligated
0 1242 ANIMAL AND PLANT HEALTH INSPECTION SERVICE 0.0 0 1268 1.418465e+08
1 1528 IMMIGRATION AND NATURALIZATION SERVICE 0.0 0 4858 1.808390e+09
2 5800 FEDERAL EMERGENCY MANAGEMENT AGENCY 0.0 0 3857 7.360614e+08
3 1900 STATE, DEPARTMENT OF 1000.0 10 906835 9.612514e+10
4 9568 BROADCASTING BOARD OF GOVERNORS 1000.0 1 86580 1.719051e+09
5 7200 AGENCY FOR INTERNATIONAL DEVELOPMENT 1000.0 10 112903 5.683631e+10
6 2800 SOCIAL SECURITY ADMINISTRATION 81000.0 1 140669 1.636968e+10
7 1226 FOREST SERVICE 100000.0 2 24686 1.554254e+09
8 12E3 RURAL HOUSING SERVICE 110000.0 2 87659 1.124532e+09
9 2028 U.S. SECRET SERVICE 110000.0 2 1144 1.870609e+08
10 2001 DEPARTMENTAL OFFICES 128000.0 5 17514 2.408785e+09
11 4500 EQUAL EMPLOYMENT OPPORTUNITY COMMISSION 150000.0 3 32309 7.296316e+08
12 9524 NATIONAL MEDIATION BOARD 150000.0 2 1578 1.539304e+07
13 1560 ATF ACQUISITION AND PROPERTY MGMT DIV 500000.0 7 43268 2.834267e+09
14 1227 SOIL CONSERVATION SERVICE 600000.0 10 1202 2.188730e+08
15 2026 FINANCIAL CRIME ENFORCEMENT NETWORK 700000.0 10 5555 1.951988e+09
16 2022 ALCOHOL AND TOBACCO TAX AND TRADE BUREAU 757000.0 11 1850 3.831141e+08
17 12C2 FOREST SERVICE 850000.0 10 431521 1.722469e+10
18 7520 PUBLIC HEALTH SERVICE 890000.0 11 10775 1.076623e+09
19 6950 UNITED STATES COAST GUARD 1100000.0 15 25750 2.632710e+09
20 1422 BUREAU OF LAND MANAGEMENT 1100000.0 12 173546 4.997679e+09
21 2024 FEDERAL LAW ENFORCEMENT TRAINING CENTER 1100000.0 20 3227 2.085935e+08
22 6100 CONSUMER PRODUCT SAFETY COMMISSION 1100000.0 26 11533 2.786294e+08
23 1438 OFFICE OF SURFACE MINING, RECLAMATION AND ENFO... 1100000.0 10 11108 1.894074e+08
24 1153 UNITED STATES TRADE AND DEVELOPMENT AGENCY 1183519.0 10 2032 1.256019e+08
25 1230 AGRICULTURAL RESEARCH SERVICE 1300000.0 12 3171 3.087970e+08
26 2046 OFFICE OF THE COMPTROLLER OF THE CURRENCY 1400000.0 22 14898 1.668664e+09
27 1603 OFFICE OF THE SOLICITOR 1440000.0 21 1748 5.414278e+07
28 4740 PUBLIC BUILDINGS SERVICE 1573150.0 7 1488711 9.933295e+10
29 12C3 NATURAL RESOURCES CONSERVATION SERVICE 1631681.0 20 44705 1.585386e+09
... ... ... ... ... ... ...
121 2044 UNITED STATES MINT 10000000.0 50 53574 3.109847e+10
122 2700 FEDERAL COMMUNICATIONS COMMISSION 10000000.0 130 9954 1.160409e+09
123 9507 COMMODITY FUTURES TRADING COMMISSION 10500000.0 140 5559 5.247360e+08
124 4705 OFFICE OF THE ADMINISTRATOR(ACMD) 11000000.0 91 5894 1.319038e+09
125 7529 NATIONAL INSTITUTES OF HEALTH 11000000.0 115 401609 6.319064e+10
126 1540 FEDERAL PRISON SYSTEM 11000000.0 70 1019429 3.446292e+10
127 1665 PENSION BENEFIT GUARANTY CORPORATION 11000000.0 200 12647 3.141358e+09
128 7003 U.S. CITIZENSHIP AND IMMIGRATION SERVICES 11000000.0 72 37832 8.801395e+09
129 1344 PATENT AND TRADEMARK OFFICE 11000645.0 88 30012 8.576179e+09
130 3355 NATIONAL GALLERY OF ART 11500000.0 52 4034 3.882365e+08
131 6000 RAILROAD RETIREMENT BOARD 12000000.0 130 5483 2.725028e+08
132 19BM INTERNATIONAL BOUNDARY AND WATER COMMISSION: U... 13209362.0 97 4832 6.288717e+08
133 8961 FEDERAL ENERGY REGULATORY COMMISSION 14996000.0 58 5678 5.596401e+08
134 4704 OFFICE OF INSPECTOR GENERAL 15000000.0 70 1134 3.663729e+07
135 3600 VETERANS AFFAIRS, DEPARTMENT OF 15000000.0 149 5285240 2.059998e+11
136 4732 FEDERAL ACQUISITION SERVICE 15000000.0 57 1413873 2.911922e+10
137 8000 NATIONAL AERONAUTICS AND SPACE ADMINISTRATION 18000000.0 180 446963 2.112742e+11
138 1549 FEDERAL BUREAU OF INVESTIGATION 18000000.0 100 72815 1.711656e+10
139 12D2 FARM SERVICE AGENCY 18000000.0 100 35722 1.874517e+10
140 6940 NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION 19000000.0 120 12252 1.890718e+09
141 6930 FEDERAL RAILROAD ADMINISTRATION 20000000.0 579 12444 8.030557e+08
142 0559 GAO, EXCEPT COMPTROLLER GENERAL 20201548.0 189 2728 4.145613e+08
143 7001 OFFICE OF PROCUREMENT OPERATIONS 20828724.0 383 71728 2.463028e+10
144 7505 OFFICE OF ASSISTANT SECRETARY FOR ADMINISTRATI... 26106972.0 304 5181 1.413397e+10
145 9000 SELECTIVE SERVICE SYSTEM 30457000.0 300 1141 2.356269e+07
146 7528 AGENCY FOR HEALTHCARE RESEARCH AND QUALITY 32800000.0 1200 6719 2.759358e+09
147 1436 BUREAU OF SAFETY AND ENVIRONMENTAL ENFORCEMENT 32850550.0 141 2958 2.827946e+08
148 1524 DRUG ENFORCEMENT ADMINISTRATION 40000000.0 350 261796 6.429977e+09
149 12K2 AGRICULTURAL MARKETING SERVICE 50000000.0 296 22835 1.623015e+10
150 1544 U.S. MARSHALS SERVICE 120000000.0 2000 78441 7.840888e+09

151 rows × 6 columns

The median shows the most likely supplier. Agencies on the top of the table actively employ vendors whose annual revenue is less than $1mn.

The Department of Defence, the largest buyer with $4.5tn worth of goods and services bought over these 17 years, has the median vendor with $2.5mn in revenue and 20 employees. It means that half of the DoD's vendors have less than $2.5mn in revenue.

Set-Aside Deals Take a Small Share

Set-aside purchases are reserved for special categories of suppliers, like women-, minority-, and veteran-owned businesses. There's a lot of confusion about their share in transactions. We can settle this confusion with data:


In [13]:
%%sql
select
  womenownedflag,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
group by
  womenownedflag


Out[13]:
womenownedflagtransactionssum_dollarsobligated
N401919756.4051823538e+12
Y43089662.48395258992e+11

(rows: 2, time: 1.8s, cached, job: job_iOFnq8ieeeHX2gxQWTCCr4ZOEuE)

Women-owned businesses make about one tenth of the transactions, but their share in terms of sales is only 3.7%.

A cross-tabulation for major set-aside categories:


In [14]:
%%sql
select
  womenownedflag, veteranownedflag, minorityownedbusinessflag,
  count(*) transactions,
  sum(dollarsobligated) sum_dollarsobligated
from
  gpqueries:contracts.raw
group by
  womenownedflag, veteranownedflag, minorityownedbusinessflag
order by
  womenownedflag, veteranownedflag, minorityownedbusinessflag desc


Out[14]:
womenownedflagveteranownedflagminorityownedbusinessflagtransactionssum_dollarsobligated
NNY20519782.70617792041e+11
NNN352863335.91438262215e+12
NYY49461761028950046.2
NYN23590471.59152989569e+11
YNY86233680295343639.0
YNN32651631.53867438135e+11
YYY625975086157754.65
YYN1188709146319463.28

(rows: 8, time: 0.6s, cached, job: job_Qh4DhmZZecFIWeF7Wlah0cQwfSM)

For example, firms owned by women, veterans, and minorities (all represented at the same time) sell $5bn in goods and services. That's 0.07% of all government purchases.

New Vendors Emerge Each Year

Becoming a government contractor may seem difficult at first, but let's see how many new contractors the government had in 2015.


In [15]:
%%sql
select
  sum(if(before2015.dunsnumber is null, 1, 0)) new_vendors,
  sum(if(before2015.dunsnumber is null, 0, 1)) old_vendors
from
  flatten((select unique(dunsnumber) dunsnumber from gpqueries:contracts.raw where fiscal_year = 2015), dunsnumber) in2015
left join
  flatten((select unique(dunsnumber) dunsnumber from gpqueries:contracts.raw where fiscal_year < 2015), dunsnumber) before2015
  on before2015.dunsnumber = in2015.dunsnumber


Out[15]:
new_vendorsold_vendors
18258131537

(rows: 1, time: 0.7s, cached, job: job_OjFASl4YsdPFMqmG63vz8lXqgzc)

(The variable dunsnumber refers to the unique number each contractor obtains at SAM.gov. It identifies unique contractors better than vendorname, which varies.)

In 2015, 12% of the suppliers turned out to be companies that had never sold to the government.

More

This was a short intro to government contracting data.

For more, check the Government Procurement Queries project.

Contacts